Code
import polars as pl
import altair as alt
import numpy as np
from great_tables import GT, style, loc, md
import gcThis is a python implementation of Bruce G.S. Hardie’s An Excel-based Introduction to Analysing Buyer Behaviour Using Consumer Panel Data using Polars.
Several leading market research firms (e.g., Kantar, IRI, Nielsen) collect so called consumer panel data and sell reports and analyses based on these data to interested parties. A large number of households are recruited to join the panel and they record all their grocery/HBA purchases. The idea of a panel is that we have repeated observations of the same people. The ability to track what individual households are purchasing over time can give important insights into what behavioural changes lie beneath observed changes in aggregate sales data. (terms panellist and household are used interchangeably) Nowadays, it is common for panellists to record their purchasing by scanning product barcodes via an app on their smartphones.
The objective of this notebook is to provide an introduction to basic analyses we can undertake using panel data. After briefing describing the data we will be working with, we present some preliminary aggregate-level analyses. Next we introduce some basic brand performance measures and consider the simple analyses that describe the variation we observe in buyer behaviour in a given time period. This analysis focuses on one brand at a time; next we consider some basic analyses that describe consumers’ buying of multiple brands in a category. We finish with some basic analyses that describe how buyer behaviour evolves over time, both for established products and new products.
Note: Whereas consumer panel data gives us information on the purchasing of a sample of customers for the whole category, the data in a firm’s transaction database gives us complete information on the purchasing of our products (but not those of our competitors). At a fundamental level, the types of reports developed by market research firms over the past 60+ years are a good starting point for the types of reports a firm should create as it seeks to understand the buying behaviour of its customers.
A traditional consumer panel works in the following manner:
When an individual first joins the panel, they complete a detailed questionnaire. A section of this questionnaire focuses on the demographics of their household. (This is typically updated once a year.)
After each shopping trip, each panellist records their purchases, scanning the barcode associated with each product and recording other information such as where the purchase was made (store or channel), the price paid, and the use of promotional deals. Twenty years ago, this would have been done using a custom handheld barcode scanner provided by the market research firm. These days, it is more common to use a smartphone app.
These data are uploaded to the research firm’s servers and merged with the purchases records of the other panellists. Each barcode is matched with detailed product information (e.g., category, brand, size, flavour) and this information is also stored in the database. The analyst can then create (typically product-category-specific) datasets for further analysis that tell us what each panellist purchased, when and where, and associated transaction- and/or product-specific data that may be of interest.
Panellists drop out of the panel all the time, and the research firm will be recruiting replacement households on a regular basis. When creating a dataset for further analysis, it is generally desirable to work with a so-called static sample of panellists, which comprises all those panellists active using the time period of interest; new panellists, as well as those that dropped out during the given time period, are excluded.
import polars as pl
import altair as alt
import numpy as np
from great_tables import GT, style, loc, md
import gcWe will make use of two datasets as we explore the basic types of summaries of buyer behaviour that can be created using consumer panel data. The first contains data on the purchasing of established brands in a mature product category, while the second contains data on the purchasing of a new product. Both datasets were created using static samples. While these are small datasets and contain a subset of the information available in the research firm’s databases, they are more than sufficient to convey the logic of creating the key summaries of buyer behaviour. Neither dataset includes data on the demographics of each panellist. As such, we will not consider how to create reports that explore how behaviours differ across demographic groups (e.g., by age or geography). However, anyone comfortable with the analyses undertaken in this note should be able to work out how to create such reports for themselves.
The file edible_grocery.csv contains two years of data from a panel of 5021 households on their purchasing in an unnamed edible grocery product category. (We intentionally do not identity the category and the associated brand names.) There are 119 SKUs in this category. 91 SKUs are associated with the four largest brands in the category, which we have named Alpha, Bravo, Charlie, and Delta. The remaining SKUs belong to very low-share brands and we grouped them under the brand Other.
Note: SKU - A stock-keeping unit (SKU) is a unique combination of the attributes (e.g., brand, package type, package size, flavour) that define the products in the category.
Each record in this file consists of seven fields:
panel_id A unique identifier for each household.
trans_id A unique identifier for the purchase occasion.
week The week number in which the purchase occurred. Week 1 corresponds to the calendar week starting on January 1, 20xy. Week 53 corresponds to the calendar week starting on December 31, 20xy.
sku_id The SKU code.
units The number of units purchased on the particular purchase occasion.
price The price per unit paid at the point of purchase.
brand The brand associated with the SKU purchased.
The associated file sku_weight.csv gives us the weight (in grams) of each SKU. There are two fields: sku_id and weight.
grocery_lf = pl.scan_csv(
source="data/panel-datasets/edible_grocery.csv",
has_header=True,
separator=",",
schema={
'panel_id': pl.UInt32,
'trans_id': pl.Int32,
'week': pl.UInt16,
'sku_id': pl.UInt8,
'units': pl.Int16,
'price': pl.Float32,
'brand': pl.Categorical})
grocery_lf.head().collect()| panel_id | trans_id | week | sku_id | units | price | brand |
|---|---|---|---|---|---|---|
| u32 | i32 | u16 | u8 | i16 | f32 | cat |
| 3102011 | 1569 | 6 | 19 | 1 | 2.79 | "Alpha" |
| 3102012 | 4301 | 15 | 15 | 1 | 3.19 | "Alpha" |
| 3102012 | 4301 | 15 | 15 | 1 | 3.19 | "Alpha" |
| 3102012 | 4301 | 15 | 38 | 1 | 3.49 | "Bravo" |
| 3102012 | 4301 | 15 | 44 | 1 | 3.49 | "Bravo" |
sku_lf = pl.scan_csv(
source="data/panel-datasets/sku_weight.csv",
has_header=True,
separator=",",
schema={
'sku_id': pl.UInt8,
'weight': pl.Int16})
sku_lf.head().collect()| sku_id | weight |
|---|---|
| u8 | i16 |
| 1 | 400 |
| 2 | 400 |
| 3 | 400 |
| 4 | 250 |
| 5 | 1000 |
panel_size = 5021“Kiwi Bubbles” is a masked name for a shelf-stable juice drink, aimed primarily at children, which is sold as a multipack with several single-serve containers bundled together. Prior to national launch, it underwent a year-long test conducted in two of IRI’s BehaviorScan markets. The file kiwibubbles_trans.csv contains purchasing data for the new product, drawn from 1300 panellists in Market 1 and 1499 panellists in Market 2. (The purchasing of other brands in the category has been excluded from the dataset.)
Each record in this file consists of five fields:
ID A unique identifier for each household.
Market 1 or 2.
Week The week in which the purchase occurred.
Day The day of the week in which the purchase occurred. (The product was launched on day 1 of week 1.)
Units The number of units of the new product purchased on the particular purchase occasion.
kiwi_lf = pl.scan_csv(
source="data/panel-datasets/kiwibubbles_trans.csv",
has_header=True,
separator=",",
schema={
'ID': pl.UInt16,
'Market': pl.UInt8,
'Week': pl.Int16,
'Day': pl.Int16,
'Units': pl.Int16})
kiwi_lf.head().collect()| ID | Market | Week | Day | Units |
|---|---|---|---|---|
| u16 | u8 | i16 | i16 | i16 |
| 10001 | 1 | 19 | 3 | 1 |
| 10002 | 1 | 12 | 5 | 1 |
| 10003 | 1 | 37 | 7 | 1 |
| 10004 | 1 | 30 | 6 | 1 |
| 10004 | 1 | 47 | 3 | 1 |
# Weekly Grocery Sales LazyFrame (Query Plan): Weekly 'spend' by 'Category', 'Brand' or 'All'
def weekly_spend_summary(brand, lf):
summary = (
lf
.select(['week', 'units', 'price', 'brand'])
.with_columns(((pl.col('units') * pl.col('price'))).alias('spend'))
)
if brand == 'Category': # Return LazyFrame of total category
summary = summary.group_by('week')
elif brand == 'All': # Return LazyFrame of all brands
summary = summary.group_by('week', 'brand')
else: # Return LazyFrame of specified brand
summary = summary.filter(
pl.col('brand') == brand
).group_by('week', 'brand')
summary = summary.agg(
pl.col("spend").sum().alias('Weekly Spend')
).sort('week')
return summary# Weekly Grocery Volume Sales LazyFrame (Query Plan): Weekly 'volume' by 'Brand' or 'All'
def weekly_vol_summary(brand, lf):
with pl.StringCache():
lf = (
lf
.join(
other=sku_lf,
left_on="sku_id",
right_on="sku_id"
)
.select(['week', 'units', 'brand', 'weight'])
)
if brand != 'All':
brand = [brand] if type(brand) == str else brand
lf = lf.filter(
pl.col('brand').is_in(*[brand])
)
summary = lf.with_columns(
(((pl.col('units') * pl.col('weight'))/1000)).alias('volume')
).group_by('week', 'brand').agg(
pl.col("volume").sum().alias('Weekly Volume')
).sort('week')
return summary# Altair Weekly Line Plot
def weekly_plot(dataframe, y, year=2, color=None, title="", y_axis_label="", pct=False, legend=False):
# Configure the color encoding only if color is provided
if color is not None:
color_encoding = alt.Color(
f'{color}:N', # N = a discrete unordered category
legend=alt.Legend(title=color) if legend else None # Add legend conditionally
)
else:
color_encoding = alt.Color() # No color encoding
chart = alt.Chart(dataframe).mark_line(strokeWidth=1).encode(
x = alt.X(
'week',
axis=alt.Axis(
values=np.arange(0, (year*52) + 1, 13), # Explicitly specify quarter-end weeks
labelExpr="datum.value", # Show only these labels
title='Week'
)
),
y = alt.Y(
f'{y}:Q', # Q = a continuous real-valued quantity
title=y_axis_label,
axis=alt.Axis(format="$,.0f") if not pct else alt.Axis(format=",.0%")
),
color = color_encoding
).properties(
width=650,
height=250,
title=title
).configure_view(
stroke=None
).configure_axisY(
# grid=False # turn off y-axis grid if required
)
return chart # alt.JupyterChart(chart)# Annual Sales Summary LazyFrame for All Brands
def annual_sales_summary():
summary = (
weekly_spend_summary('All', grocery_lf)
.with_columns((pl.col("week") / 52).ceil().alias('year'))
.group_by(['year', 'brand'])
.agg(pl.col("Weekly Spend").sum().alias('Yearly Sales'))
).sort('year')
return summarydef freq_dist_plot(
data,
column,
bin_edges,
labels,
x_title,
y_title,
chart_title,
subtitle,
width=650,
height=250,
label_angle=0,
left_closed=True,
compute_rel_freq=True
):
"""
Creates a standardized Altair bar chart for relative frequency distribution plots.
Parameters:
- data (Polars LazyFrame or DataFrame): Input dataset.
- column (str): Column to analyze for distribution.
- bin_edges (array-like): Edges for binning.
- labels (list of str): Labels for the bins.
- x_title (str): Title for the x-axis.
- y_title (str): Title for the y-axis.
- chart_title (str): Main title for the chart.
- subtitle (str): Subtitle for the chart.
- width (int, optional): Width of the chart. Default is 650.
- height (int, optional): Height of the chart. Default is 250.
- label_angle (int, optional): Angle for x-axis labels. Default is 0.
- left_closed (bool, optional): Whether bins are left-closed. Default is True.
- compute_rel_freq (bool, optional): Whether to compute relative frequencies. Default is True.
Returns:
- alt.Chart: The generated Altair chart.
"""
# Apply binning to the data
binned_data = data.with_columns(
pl.col(column).cut(bin_edges, labels=labels, left_closed=left_closed).alias("cut")
)
# Optionally compute relative frequencies
if compute_rel_freq:
binned_data = (
binned_data
.group_by("cut")
.agg(pl.col("cut").count().alias("Frequency"))
.with_columns(
(pl.col("Frequency") / pl.col("Frequency").sum()).alias("% of Total")
)
.collect()
)
# Create the Altair chart
chart = alt.Chart(binned_data).mark_bar().encode(
x=alt.X("cut:O", axis=alt.Axis(labelAngle=label_angle, title=x_title), sort=labels),
y=alt.Y("% of Total:Q", axis=alt.Axis(format=".0%", title=y_title)),
).properties(
width=width,
height=height,
title={"text": chart_title, "subtitle": subtitle},
)
return chartBefore we start analysing household-level behaviour, let us first get a sense of the general sales patterns observed in this category.
Our initial objective is to plot weekly revenue for all the brands and the overall category, respectively, and then plot weekly (volume) market shares for target brands.
# Weekly Sales Pivot Table - Polars DataFrame
# For visualizing and inspecting only
weekly_spend_summary('All', grocery_lf).collect().pivot(
on="brand",
index="week",
values="Weekly Spend",
sort_columns=True,
).with_columns(
pl.sum_horizontal(pl.exclude('week')).alias("Total") # Row total
).sort("week")| week | Alpha | Bravo | Charlie | Delta | Other | Total |
|---|---|---|---|---|---|---|
| u16 | f32 | f32 | f32 | f32 | f32 | f32 |
| 1 | 331.460114 | 247.780121 | 51.790005 | 17.639999 | 18.75 | 667.420227 |
| 2 | 567.249939 | 398.640137 | 45.429996 | 34.190002 | 23.23 | 1068.740112 |
| 3 | 497.96994 | 639.980103 | 51.110001 | 39.389999 | 14.550001 | 1243.000122 |
| 4 | 1389.95752 | 472.5 | 51.869999 | 85.039993 | 39.779995 | 2039.147461 |
| 5 | 358.240204 | 252.540115 | 40.209995 | 69.540001 | 45.929993 | 766.460327 |
| … | … | … | … | … | … | … |
| 100 | 692.479919 | 420.479706 | 69.599991 | 54.590004 | 47.969997 | 1285.119629 |
| 101 | 265.770081 | 264.660065 | 64.029999 | 35.049999 | 26.880001 | 656.390198 |
| 102 | 379.610168 | 456.789429 | 43.499996 | 20.4 | 43.609993 | 943.909607 |
| 103 | 532.5 | 440.999725 | 76.400002 | 40.440006 | 45.549999 | 1135.889771 |
| 104 | 394.889984 | 233.190048 | 53.18 | 68.259995 | 36.68 | 786.200012 |
weekly_plot(dataframe=weekly_spend_summary('Category', grocery_lf).collect(),
y='Weekly Spend',
title='Category - Weekly Revenue',
y_axis_label='Spend ($)',
pct=False,
legend=False)weekly_plot(dataframe=weekly_spend_summary('Alpha', grocery_lf).collect(),
y='Weekly Spend',
title='Alpha - Weekly Revenue',
y_axis_label='Spend ($)',
pct=False,
legend=False)weekly_plot(dataframe=weekly_spend_summary('Bravo', grocery_lf).collect(),
y='Weekly Spend',
title='Bravo - Weekly Revenue',
y_axis_label='Spend ($)',
pct=False,
legend=False)Let us finish this preliminary analysis of the data by computing the annual sales of each brand.
annual_sales_pivot = annual_sales_summary().collect().pivot(
on='brand',
index='year',
sort_columns=True
).with_columns(
pl.sum_horizontal(pl.all().exclude('year')).alias("Total") # add totals column
)
(
GT(annual_sales_pivot, rowname_col="year")
.tab_header(title="Annual Sales ($)")
.tab_stubhead(label="Year")
.fmt_currency()
.fmt_integer(columns='year')
.data_color(
columns=columns,
domain=[100, 36_000],
palette=["white", "rebeccapurple"]
)
)| Annual Sales ($) | ||||||
|---|---|---|---|---|---|---|
| Year | Alpha | Bravo | Charlie | Delta | Other | Total |
| 1 | $33,570.95 | $28,603.34 | $5,120.87 | $3,271.51 | $1,535.23 | $72,101.91 |
| 2 | $35,250.75 | $26,926.85 | $3,922.68 | $2,820.81 | $1,739.82 | $70,660.91 |
annual_change_sales = annual_sales_pivot.with_columns(
pl.col('*').exclude('year').pct_change()
).filter(
pl.col('year') == 2
).unpivot(
index="year", # Keep 'year' as a fixed identifier
value_name='% Change',
variable_name='brand'
)
(
GT(annual_change_sales, rowname_col='brand')
.tab_header(title='% Change in Annual Sales')
.tab_stubhead(label="Brands")
.fmt_percent()
.data_color(
columns=['% Change'],
domain=[-0.3, 0.3],
palette=['orange', 'white', 'rebeccapurple']
)
.cols_hide('year')
.cols_label(brand='Y-o-Y % Change')
)| % Change in Annual Sales | |
|---|---|
| Brands | % Change |
| Alpha | 5.00% |
| Bravo | −5.86% |
| Charlie | −23.40% |
| Delta | −13.78% |
| Other | 13.33% |
| Total | −2.00% |
chart = alt.Chart(annual_sales_summary().collect()).mark_bar().encode(
x=alt.X("brand:N",
axis=alt.Axis(title='Brands', labelAngle=0)),
xOffset="year:N",
y=alt.Y("Yearly Sales:Q",
axis=alt.Axis(format="$,.0f"),
title='Yearly Revenue ($)'),
color=alt.Color("year:N", title='Year'),
).properties(
width=650,
height=250,
title='Year 1 & Year 2 Revenues'
).show()base = alt.Chart(annual_change_sales).encode(
x=alt.X("brand:N",
title='Brand',
axis=alt.Axis(labelAngle=0)
),
y=alt.Y("% Change:Q",
axis=alt.Axis(format=".0%")
),
color=alt.condition(
alt.datum["% Change"] > 0, # fixed this
alt.value("green"), # The positive color
alt.value("red") # The negative color
),
text=alt.Text(
'% Change',
format=(".1%")
)
).properties(
width=650,
height=250,
title='Y-o-Y % Change in Revenue'
)
chart = base.mark_bar() + base.mark_text(
align='center',
baseline=alt.expr(alt.expr.if_(alt.datum['% Change'] >= 0, 'bottom', 'top')),
dy=alt.expr(alt.expr.if_(alt.datum['% Change'] >= 0, -2, 2)),
dx=0)
chart.show()avg_sku_price = grocery_lf.select(
['week', 'sku_id', 'price', 'brand']
).filter(
(pl.col('week') <= 52) & # pricing in the first year
(pl.col('brand') == 'Alpha') # Filter by Alpha brand
).group_by('brand', 'sku_id').agg(
pl.col('price').mean()
).drop('brand').sort(
pl.col('sku_id').cast(pl.Int8)
)
(
GT(avg_sku_price.collect(), rowname_col='sku_id')
.tab_header(title='Alpha - Average SKU Retail Prices')
.tab_stubhead(label="SKU IDs")
.fmt_currency()
.data_color(
columns=['price'],
domain=[1, 15],
palette=['white', 'rebeccapurple']
)
.cols_label(price='Average Price ($)')
)| Alpha - Average SKU Retail Prices | |
|---|---|
| SKU IDs | Average Price ($) |
| 4 | $1.85 |
| 5 | $3.44 |
| 6 | $2.61 |
| 7 | $5.10 |
| 8 | $12.43 |
| 9 | $6.59 |
| 10 | $10.49 |
| 11 | $2.27 |
| 12 | $2.79 |
| 14 | $1.99 |
| 15 | $3.42 |
| 16 | $2.58 |
| 17 | $2.80 |
| 18 | $3.27 |
| 19 | $2.62 |
| 20 | $3.43 |
| 21 | $2.99 |
| 22 | $2.09 |
avg_sku_price.drop(pl.col('sku_id')).describe()| statistic | price |
|---|---|
| str | f64 |
| "count" | 18.0 |
| "null_count" | 0.0 |
| "mean" | 4.042794 |
| "std" | 2.947076 |
| "min" | 1.84705 |
| "25%" | 2.582056 |
| "50%" | 2.99 |
| "75%" | 3.438095 |
| "max" | 12.43 |
Garbage Collect
exceptions = ['grocery_lf', 'sku_lf', 'kiwi_lf', 'In', 'exceptions', 'active_variables']
active_variables = [
var for var, value in globals().items()
if not var.startswith('_') # Exclude variables that start with "_"
and var not in exceptions # Exclude variables in the exceptions list
and isinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only
]
for var in active_variables:
del globals()[var]
del active_variables, exceptions, var
gc.collect()668
The objective here is to explore how to perform the basic analyses that describe the variation we observe in buyer behaviour in a given time period. We will continue to work with the edible grocery dataset, exploring both purchase frequency and spend.
Before we can doing any analysis, we need to create some summary datasets. The first will summarise how many times each panellist purchased each brand as well as in the category. The second will summarise how much each panellist spent on each brand and in the category.
grocery_lf.filter(
(pl.col('panel_id') == 3102016) &
(pl.col('trans_id') == 844)
).collect()| panel_id | trans_id | week | sku_id | units | price | brand |
|---|---|---|---|---|---|---|
| u32 | i32 | u16 | u8 | i16 | f32 | cat |
| 3102016 | 844 | 4 | 5 | 1 | 3.49 | "Alpha" |
| 3102016 | 844 | 4 | 5 | 1 | 3.49 | "Alpha" |
| 3102016 | 844 | 4 | 5 | 1 | 3.49 | "Alpha" |
| 3102016 | 844 | 4 | 15 | 1 | 3.49 | "Alpha" |
| 3102016 | 844 | 4 | 15 | 1 | 3.49 | "Alpha" |
| 3102016 | 844 | 4 | 89 | 1 | 2.49 | "Delta" |
On this one shopping trip (trans_id = 844), panellist 3102021 purchased a total of six items: three packs of SKU 5, two packs of SKU 15, and one pack of SKU 89. They purchased two different brand, Alpha and Delta.
Why are three lines for SKU 5 and not one line with units = 3. This is simply a function of how the items were scanned at the checkout. Some checkout operators will scan the three items separately; this would result in three lines in the transaction file, each with units = 1. Others will scan the item and press “3” on their till, resulting in one line in the transaction file with units = 3.
By convention, this purchase occasion is recorded as one category transaction, one Alpha transaction, and one Delta transaction. The number of units of Alpha purchased is five. When we say that the panellist made one category transaction, we mean they purchased at least one item in the category on that shopping trip. When we say that the panellist made one Alpha transaction, we mean they purchased at least one item associated with the brand on that shopping trip.
In order to analyse buyer behaviour in terms of transactions, we need to know the number of brand and category transactions for each person. We cannot work directly with the dataset we have been using; we effectively need to collapse the five rows associated with Alpha into one. This will require some intermediate analysis, which we undertake in the following manner.
Result of the intermediate prep should produce a dataset that summarizes transaction 844 by panellist 3102016 as 1 occasion of Alpha brand purchase and 1 occasion of Delta brand purchase:
One row per transaction, with a binary indicator of whether or not each brand was purchased.
Filter for year == 1, group by transaction ID, panel ID and brand, then count unique panel_ids per transaction and brand
Each panelist has done some number of transactions. Group unique transactions and obtain the count of transactions occasions in which any listed brand was purchased
Note: this is not a sum of total units (by brand) purchased per panellist.
Note: Because there are multiple entires for the same transaction ID, we are not doing a count of brand instance per panelist (without trans_id grouping) as this would duplicate the occasions in some cases.
Produce a dataset where there is just one row per transaction, with a binary indicator of whether or not each brand was purchased.
Create a new column to track category purchase. Each row is an instance of at least one brand purchase so category column is set to 1.
Panellist-level transaction summary
The function below executes a query plan which summarizes each panellist’s purchase occasions (transactions) for each brand
def trans_summary(brand, lf, year):
# Primary Step: Filter by Year 1 and Remove Unused Columns
filtered_lf = lf.filter(
(pl.col('week') <= (year * 52)) &
(pl.col('week') > ((year - 1) * 52))
).drop(
pl.col('week','sku_id')
)
# Intermediate Step: Group by trans_id, panel_id, and brand
group_trans = filtered_lf.drop(
pl.col('price', 'units')
).group_by(
'trans_id', 'panel_id', 'brand'
).n_unique()
if brand == "Category":
# Panellist-level category transaction summary
summary = group_trans.group_by(
'panel_id'
).n_unique()
else:
# Panellist-level brand transaction summary
summary = group_trans.filter(
pl.col('brand') == brand
).group_by(
'panel_id'
).n_unique()
return summary.select(
pl.col('panel_id'),
pl.col('trans_id').alias('# of Purchases'),
pl.col('brand').alias('Brands Purchased')
)def trans_pivot(lf, year):
# Primary Step: Filter by Year 1 and Remove Unused Columns
filtered_lf = lf.filter(
(pl.col('week') <= (year * 52)) &
(pl.col('week') > ((year - 1) * 52))
).drop(
pl.col('week','sku_id')
)
# Intermediate Step: Group by trans_id, panel_id, and brand
group_trans = filtered_lf.drop(
pl.col('price', 'units')
).group_by(
'trans_id', 'panel_id', 'brand'
).n_unique() # count of unique entires
summary = group_trans.collect().pivot(
on='brand',
index='panel_id',
values='panel_id',
aggregate_function="len"
).join(
other=group_trans.group_by('panel_id').n_unique().drop('brand').collect(),
on='panel_id'
).rename(
{'trans_id': 'Category'}
).drop(
pl.col('panel_id')
)
return summaryPanellist-level spend summary
The function below executes a query plan which summarizes the amount each panellist spent on each brand (and in the category) during year 1.
def spend_summary(brand, lf, year):
group_spend = lf.filter(
(pl.col('week') <= (year * 52)) &
(pl.col('week') > ((year - 1) * 52))
).drop(
pl.col('week','sku_id')
).with_columns(
((pl.col('units') * pl.col('price'))).alias('spend')
)
if brand == "Category":
# Panellist-level category spend summary
summary = group_spend.drop(
pl.col('units', 'price', 'brand')
).group_by(
'panel_id'
).agg(
pl.col('spend').sum()
)
else:
# Panellist-level brand spend summary
summary = group_spend.drop(
pl.col('units', 'price')
).group_by(
'panel_id', 'brand'
).agg(
pl.col('spend').sum()
).filter(
pl.col('brand') == brand
).drop('brand')
return summarydef spend_pivot(lf, year):
group_spend = lf.filter(
(pl.col('week') <= (year * 52)) &
(pl.col('week') > ((year - 1) * 52))
).drop(
pl.col('week','sku_id')
).with_columns(
((pl.col('units') * pl.col('price'))).alias('spend')
).collect().pivot(
on='brand',
index='panel_id',
values='spend',
aggregate_function='sum'
).select(
'Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'
)
return group_spendPanellist-level volume purchasing summary
The function below executes a query plan which summarises each panellist’s volume purchasing in year 1.
def vol_summary(brand, lf, year):
with pl.StringCache():
group_vol = lf.filter(
(pl.col('week') <= (year * 52)) &
(pl.col('week') > ((year - 1) * 52))
).join(
other=sku_lf,
left_on='sku_id',
right_on='sku_id'
).drop(
pl.col('week','sku_id')
).with_columns(
# volume column that is the product of weight of each SKU and the units of SKU sold
(((pl.col('units') * pl.col('weight'))/1000)).alias('volume') # # weight from grams to kilograms
).drop(
pl.col('units', 'price', 'weight')
)
if brand == "Category":
# Panellist-level category volume sales summary
summary = group_vol.drop(
pl.col('brand')
).group_by(
'panel_id'
).agg(
pl.col('volume').sum()
)
else:
# Panellist-level brand volume sales summary
summary = group_vol.group_by(
'panel_id', 'brand'
).agg(
pl.col('volume').sum()
).filter(
pl.col('brand') == brand
).drop('brand')
return summarydef vol_pivot(lf, year):
with pl.StringCache():
group_vol = lf.filter(
(pl.col('week') <= (year * 52)) &
(pl.col('week') > ((year - 1) * 52))
).join(
other=sku_lf,
left_on='sku_id',
right_on='sku_id'
).drop(
pl.col('week','sku_id')
).with_columns(
# volume column that is the product of weight of each SKU and the units of SKU sold
(((pl.col('units') * pl.col('weight'))/1000)).alias('volume') # # weight from grams to kilograms
).drop(
pl.col('units', 'price', 'weight')
).collect().pivot(
on='brand',
index='panel_id',
values='volume',
aggregate_function='sum'
).select(
'Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'
)
return group_volLooking at panellist-level spend and panellist-level volume purchasing, we note that for each row, Category equals the sum of the brand numbers, as we would expect. But this is not always the case in panellist-level transactions. Why is the sum of the brand-specific numbers sometimes greater than the associated category number?
Answer: Because a single transactions (once grouped) can have occasions of multiple brand purchases but the transaction only counts as one occasion of a category purchase. Consider the panellist 3102016 and her transaction ID 844, The 4 entires on the one transaction ID count as a single category transaction, a single Alpha brand transaction and a single Delta transaction. The category transaction count is not a sum of brand purchase transaction count because regardless of the brand(s) and quantities purchased, the panelist purchased the category once in that transaction only.
Two standard brand performance metrics that summarize purchasing behaviour are penetration and purchases per buyer (PPB).
Looking at panellist-level transaction summary, we see that there are 4574 rows in this table, meaning that we have summary data on the purchasing of 4574 households in year 1. But there are 5021 households in the panel. What has happened to the remaining 447 households? They did not make any category purchase during the year. (But they will have purchased in other categories.)
# the number of panellists who purchased each brand at least once in year 1
buyers = trans_pivot(grocery_lf, 1).count().select(
'Alpha', 'Bravo', 'Charlie', 'Delta', 'Other', 'Category'
).unpivot(
variable_name='brand',
value_name='buyers'
)
(
GT(buyers, rowname_col='brand')
.tab_header(title='Number of Buyers in Year 1')
.tab_stubhead(label="Brands/Category")
.fmt_integer()
.data_color(
columns=['buyers'],
rows=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[100, 3_000],
palette=['white', 'rebeccapurple']
)
.cols_label(buyers='# Of Buyers')
)| Number of Buyers in Year 1 | |
|---|---|
| Brands/Category | # Of Buyers |
| Alpha | 2,624 |
| Bravo | 2,562 |
| Charlie | 813 |
| Delta | 380 |
| Other | 176 |
| Category | 4,574 |
# On how many purchase occasions did the brand/category buyers of each brand/category buy that brand/category?
# i.e, On how many purchase occasions did the 2624 buyers of Alpha buy Alpha?
transactions = trans_pivot(grocery_lf, 1).sum().select(
'Alpha', 'Bravo', 'Charlie', 'Delta', 'Other', 'Category'
).unpivot(
variable_name='brand',
value_name='transactions'
)
(
GT(transactions, rowname_col='brand')
.tab_header(title='Purchase Occasions by Buyers')
.tab_stubhead(label="Brands/Category")
.fmt_integer()
.data_color(
columns=['transactions'],
rows=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[400, 10_000],
palette=[ 'white', 'rebeccapurple']
)
.cols_label(transactions='# Of Transactions')
)| Purchase Occasions by Buyers | |
|---|---|
| Brands/Category | # Of Transactions |
| Alpha | 9,060 |
| Bravo | 8,255 |
| Charlie | 1,882 |
| Delta | 859 |
| Other | 422 |
| Category | 20,030 |
panel_size = 5021
# Penetration is simply the number of brand buyers divided by the number of panellists, expressed as a percentage
penetration = buyers.with_columns(
(pl.col('buyers') / panel_size).alias('penetration')
)
# Purchases per buyer (PPB) is simply to total number of purchases occasions divided by the number of buyers.
ppb = transactions.with_columns(
(pl.col('transactions') / penetration.get_column('buyers')).alias('ppb')
)(
GT(penetration, rowname_col='brand')
.tab_header(title='Penetration by Category & Brand')
.tab_stubhead(label="Brands/Category")
.fmt_percent()
.data_color(
columns=['penetration'],
rows=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 1],
palette=['white', 'rebeccapurple']
)
.cols_label(penetration='Penetration (%)')
.cols_hide('buyers')
)| Penetration by Category & Brand | |
|---|---|
| Brands/Category | Penetration (%) |
| Alpha | 52.26% |
| Bravo | 51.03% |
| Charlie | 16.19% |
| Delta | 7.57% |
| Other | 3.51% |
| Category | 91.10% |
(
GT(ppb, rowname_col='brand')
.tab_header(title='Purchases per Buyer (PPB)')
.tab_stubhead(label="Brands/Category")
.fmt_number()
.data_color(
columns=['ppb'],
rows=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 5],
palette=['white', 'rebeccapurple']
)
.cols_label(ppb='PPB')
.cols_hide('transactions')
)| Purchases per Buyer (PPB) | |
|---|---|
| Brands/Category | PPB |
| Alpha | 3.45 |
| Bravo | 3.22 |
| Charlie | 2.31 |
| Delta | 2.26 |
| Other | 2.40 |
| Category | 4.38 |
We see that 91% of the households in the panel purchased in the category at least once in year 1. (This is a widely purchased product category.) On average, they purchased in the category 4.4 times that year. Looking at Alpha, we see that 52% of the households in the panel purchased the brand at least once, purchasing it on average 3.5 times.
Penetration and PPB are in fact summary measures of an important but frequently overlooked summary of buyer behaviour: the distribution of purchase frequency. We first explore how to create this summary of category purchasing and then consider how to create such a summary of brand purchasing.
Looking at the panellist-level transaction summary, the distribution of category purchasing is determined by counting how many households made one category purchase (panellists 3102011, 3102046, etc.), two category purchases (panellists 3102012, 3102021, etc.), and so on.
# Category Transaction Distribution
category_trans_dist = (
trans_summary('Category', grocery_lf, 1)
.group_by(pl.col('# of Purchases'))
.agg(pl.len().alias("Frequency"))
.collect()
# Add zero purchase row to the top of the data series with a new column name 'Frequency'
.pipe(
lambda df: df.vstack(
pl.DataFrame(
{
"# of Purchases": [0],
"Frequency": [panel_size - df["Frequency"].sum()],
}
).with_columns(
pl.col("# of Purchases").cast(pl.UInt32),
pl.col('Frequency').cast(pl.UInt32)
)
)
)
.sort(by='# of Purchases')
# Add percentage of total column
.with_columns(
(pl.col('Frequency') / pl.col('Frequency').sum()).alias("% of Total")
)
)
(
GT(category_trans_dist, rowname_col='# of Purchases')
.tab_header(title='Category Purchase Distribution')
.tab_stubhead(label="# of Purchases")
.fmt_integer(columns='Frequency')
.fmt_percent(columns='% of Total')
.data_color(
columns=['% of Total'],
domain=[0, 0.2],
palette=['white', 'rebeccapurple']
)
)| Category Purchase Distribution | ||
|---|---|---|
| # of Purchases | Frequency | % of Total |
| 0 | 447 | 8.90% |
| 1 | 667 | 13.28% |
| 2 | 759 | 15.12% |
| 3 | 757 | 15.08% |
| 4 | 612 | 12.19% |
| 5 | 514 | 10.24% |
| 6 | 381 | 7.59% |
| 7 | 247 | 4.92% |
| 8 | 209 | 4.16% |
| 9 | 127 | 2.53% |
| 10 | 85 | 1.69% |
| 11 | 57 | 1.14% |
| 12 | 50 | 1.00% |
| 13 | 27 | 0.54% |
| 14 | 31 | 0.62% |
| 15 | 15 | 0.30% |
| 16 | 16 | 0.32% |
| 17 | 7 | 0.14% |
| 18 | 4 | 0.08% |
| 19 | 2 | 0.04% |
| 20 | 4 | 0.08% |
| 22 | 1 | 0.02% |
| 25 | 1 | 0.02% |
| 27 | 1 | 0.02% |
If we wish to create a visual representation of this distribution, it is tempting to simply plot the data in % of total column. However, the resulting plot would be misleading as some purchase frequencies are missing in the data. In particular, we see that no one made 21 category purchases; ditto for 23, 24, and 26. One solution is to insert manually the missing number of purchases levels with 0 frequencies and then plot the data. However, the observed (relative) frequencies in the right tail are so small that they do not show up in a plot. We can therefore create a right-censored distribution. Here we have chosen 15 as the (right) censoring point; we see that 51 households (or 1% of the panellists) made 15 or more category purchases in year 1.
There is nothing magical about our choice of 15. Generally, the choice of censoring point is a function of how many bins you wish to display and the height of the right-most bar with which you feel comfortable.
Note that while the right-censored distribution is useful when creating a summary figure or table, it is of limited value beyond that. For example, it is not possible to compute the mean purchase frequency (and therefore PPB) from this summary.
# Create a custom bin list (with an overflow bin) to label categories
labels = [str(i) if i < 15 else '15+' for i in range(0, 16, 1)]
chart = freq_dist_plot(
data=category_trans_dist,
column="# of Purchases",
bin_edges=np.arange(0, 15, 1),
labels=labels,
x_title="Number of Category Transactions",
y_title="% of Households",
chart_title="Category Transaction Distribution (%)",
subtitle="Distribution of category purchasing in year 1",
left_closed=False,
compute_rel_freq=False
)
chart.show()We see that 9% of the panellists made no category purchases, 13% of the panellists made one category purchase, . . . , and 1% of the panellists made at least 15 category purchases in year 1.
What is the average number of times the category was purchased by a household in year 1?
# Mean of the relative purchase frequency
mean = category_trans_dist.select(
(pl.col('# of Purchases') * pl.col('% of Total')).sum()
).item(0,0)
print(f'{mean = :0.3f}')mean = 3.989
We see that average number of category purchases is 4.0. Why is this different from the 4.4 purchases per buyer (PPB) computed above? The mean we have just computed includes those households that made zero purchases, whereas PPB is the average among those households that made at least one (in this case category) purchase.
We can derive the relationship between these two quantities (mean of the relative frequency and PPB) in the following manner:
\[ \begin{aligned} PBB &= \sum_{x=1}^{\max x} \frac{xf_{x}}{n-f_{0}} \\ &= \sum_{x=0}^{\max x} \frac{xf_{x}}{n-f_{0}} \\ &= \sum_{x=0}^{\max x} \left(\frac{xf_{x}}{n-f_{0}}\right)\left(\frac{n}{n}\right) \\ &= \sum_{x=0}^{\max x} \left(\frac{xf_{x}}{n}\right)\left(\frac{n}{n-f_{0}}\right) \\ &= \left(\frac{n}{n-f_{0}}\right)\sum_{x=0}^{\max x} \left(\frac{xf_{x}}{n}\right) \\ &= \frac{E(X)}{1-P(X=0)} \end{aligned} \]
In other words, PPB is the mean divided by penetration.
# All category PPB computations here are equivalent
ppb_category = mean / penetration.filter(pl.col('brand') == 'Category').item(0,2)
print(ppb_category)
print(ppb.filter(pl.col('brand') == 'Category').item(0,2))
ppb_category = mean / (1 - category_trans_dist['% of Total'][0])
print(ppb_category)4.379099256668124
4.379099256668124
4.379099256668124
Let us now create the distribution of purchase frequency for Alpha, which is. With the one exception noted below, the logic follows that associated with creating the distribution of category purchasing.
In contrast to the groupby aggregation output associated with our summary of category purchasing, this aggregation does contain a zero category. However, we must be careful in our interpretation of the associated frequency. We see that 1950 category buyers did not buy Alpha in year 1. However, in order to have a complete summary of brand purchasing, we should also account for those 447 households that made no category purchases that year. The number of panellists making zero purchases of Alpha is the total number of panellists (5021) minus the number of panellists that made 1, 2, 3, . . . purchases.
Note: What is Alpha’s penetration of category buyers? (1 − 1950/4574) = 57%. - Category Penetration = (1 - % of category buyers who did not purchase the brand)
# Alpha Brand Transaction Distribution
alpha_trans_dist = (
trans_summary('Alpha', grocery_lf, 1)
.group_by(pl.col('# of Purchases'))
.agg(pl.len().alias("Frequency"))
.collect()
.pipe(
lambda df: df.vstack(
pl.DataFrame(
{
"# of Purchases": [0],
# Calculate zero purchases by subtracting the total sum of all purchases from the full panel size
"Frequency": [panel_size - df["Frequency"].sum()], # This ensures all panelists who made zero purchases are accounted for
}
).with_columns(
pl.col("# of Purchases").cast(pl.UInt32),
pl.col('Frequency').cast(pl.UInt32)
)
)
)
.sort(by='# of Purchases')
.with_columns(
# Add a '% of Total' column
(pl.col('Frequency') / pl.col('Frequency').sum()).alias("% of Total")
)
)
(
GT(alpha_trans_dist, rowname_col='# of Purchases')
.tab_header(title='Alpha Purchase Distribution')
.tab_stubhead(label="# of Purchases")
.fmt_integer(columns='Frequency')
.fmt_percent(columns='% of Total')
.data_color(
columns=['% of Total'],
domain=[0, 0.5],
palette=['white', 'rebeccapurple']
)
)| Alpha Purchase Distribution | ||
|---|---|---|
| # of Purchases | Frequency | % of Total |
| 0 | 2,397 | 47.74% |
| 1 | 733 | 14.60% |
| 2 | 517 | 10.30% |
| 3 | 400 | 7.97% |
| 4 | 277 | 5.52% |
| 5 | 227 | 4.52% |
| 6 | 144 | 2.87% |
| 7 | 101 | 2.01% |
| 8 | 71 | 1.41% |
| 9 | 47 | 0.94% |
| 10 | 28 | 0.56% |
| 11 | 24 | 0.48% |
| 12 | 16 | 0.32% |
| 13 | 16 | 0.32% |
| 14 | 12 | 0.24% |
| 15 | 6 | 0.12% |
| 16 | 3 | 0.06% |
| 18 | 1 | 0.02% |
| 20 | 1 | 0.02% |
# Create a custom bin list (with an overflow bin) to label categories
labels = [str(i) if i < 10 else '10+' for i in range(11)]
chart = freq_dist_plot(
data=alpha_trans_dist,
column="# of Purchases",
bin_edges=np.arange(0, 10, 1),
labels=labels,
x_title="Number of Transactions",
y_title="% of Households",
chart_title="Alpha Transaction Distribution (%)",
subtitle="Distribution of purchase frequency for Alpha in year 1",
left_closed=False,
compute_rel_freq=False
)
chart.show()We now turn our attention to creating summaries of total spend. Our initial goal is to create a histogram of category spend (in dollars) across those panellists that made at least one purchase in the category in year 1. In the following plot, the raw total spend data have been binned in bins with a width of $5.
We see that 16% of category buyers spent up to $5 in the category during this one-year period; 23% spent between $5 and $10; . . . and 2% spent more than $50.
Before deciding on what bin width to use when creating the histogram, let us first compute some basic descriptive statistics.
category_spend = spend_summary('Category', grocery_lf, 1).drop('panel_id')
category_spend.describe()| statistic | spend |
|---|---|
| str | f64 |
| "count" | 4574.0 |
| "null_count" | 0.0 |
| "mean" | 15.763424 |
| "std" | 12.876024 |
| "min" | 1.27 |
| "25%" | 6.76 |
| "50%" | 12.57 |
| "75%" | 20.74 |
| "max" | 166.699936 |
We wish to visualize the variability in category spend. Given this objective, some would automatically think of creating a (kernel) density plot.
alt.Chart(category_spend.collect()).transform_density(
'spend',
as_=['spend', 'Density']
).mark_area().encode(
x=alt.X("spend:Q").scale(domain=(0,150)),
y='Density:Q',
).properties(
width=650,
height=250
)While this provides a good visualisation of the shape of the distribution, it can be difficult for most “consumers” of the plot to extract some additional information that may be of interest. For example, it is not easy to answer the question “What percentage of category buyers spent $30 or less in year 1?”.
There appears to be quite a bit of variability in category spend. To get further insight into the distribution of total category spend across the panellists, we need to determine the total spend quantities associate with the various percentile levels.
percentiles = np.arange(5, 96, 5)
# np.percentile returns the k-th percentile of values in a range
percentile_values = np.percentile(category_spend.select('spend').collect(), percentiles)
percentile_df = pl.DataFrame({
'Percentile Level': percentiles,
'Percentile': percentile_values})
(
GT(percentile_df, rowname_col='Percentile Level')
.tab_header(title='Category Spend Percentile')
.tab_stubhead(label="Percentile Level")
.fmt_number()
.data_color(
columns=['Percentile'],
domain=[0, 40],
palette=['white', 'rebeccapurple']
)
)| Category Spend Percentile | |
|---|---|
| Percentile Level | Percentile |
| 5 | 2.69 |
| 10 | 3.39 |
| 15 | 4.67 |
| 20 | 5.88 |
| 25 | 6.76 |
| 30 | 7.63 |
| 35 | 9.07 |
| 40 | 10.06 |
| 45 | 11.03 |
| 50 | 12.57 |
| 55 | 13.73 |
| 60 | 15.24 |
| 65 | 16.84 |
| 70 | 18.79 |
| 75 | 20.74 |
| 80 | 23.43 |
| 85 | 26.88 |
| 90 | 31.81 |
| 95 | 39.72 |
We see that 5% of the category buyers spent $2.69 or less in the category during the year, 10% spent $3.39 or less, and so on. The heaviest 5% of buyers each spent more than $39.72 in the category during the year.
Looking out this output, we conclude that a bin size of $5 is probably about right.
How many bins do we go with? This is an empirical question. Since 5% of the panel spent more than $39.72, we certainly want to go above $40 in order to get a sense of how they are spread between $39.72 and the maximum of $166.70. We will go with $50. If too many panellists have spent more than $50, we can always add more bins. If too few panellists fall into this bin, we can always combine the bins we have created.
Note: The distributions of many customer behaviours have a long right tail. Accommodating the range of values can make it difficult to get a clear sense of what is happening on the left side of the distribution. It can therefore be helpful to bin the data (as with a histogram) but to right censor the data, assigning all of the observations with a value of \(𝑥\) or higher to an \(𝑥 +\) bin
We see that 750 panellists spent $5 or less in the category during year 1, 1073 spent between $5 and $10, . . . , and 109 spent more than $50.
We convert the raw counts into percentages and plot these percentages as a bar chart.
# Create a custom bin list (with an overflow bin) to label categories
labels = [f"{lb} - {lb + 5 if lb != 50 else '∞'}" for lb in range(-5, 55, 5)]
chart = freq_dist_plot(
data=category_spend,
column="spend",
bin_edges=np.arange(0, 51, 5),
labels=labels,
x_title="Category Spend ($)",
y_title="% of Households",
chart_title="Category Spend Distribution (%)",
subtitle="Distribution of category spend across category buyers"
)
chart.show()The general shape of this distribution (i.e., an interior mode, median less than the mean, right-skewed with a long right-tail) is what we typically observe when we look at spend data.
This, of course, depends on the bin width we choose when summarising the data. If we had chosen a bin width of $10, the left-most bar would be the highest bar and we would no longer observe an interior mode.
Let us now create the distribution of spend on Alpha. We will follow the same basic process as for the distribution of category spend with a few minor changes.
Note that there are a number of 0 or NaN values in the Alpha spend column. While these panellists purchased in the category during the year, they did not purchase any of Alpha’s SKUs. The first thing we need to do is remove these observations.
alpha_spend = spend_summary('Alpha', grocery_lf, 1).drop('panel_id')
alpha_spend.describe()| statistic | spend |
|---|---|
| str | f64 |
| "count" | 2624.0 |
| "null_count" | 0.0 |
| "mean" | 12.793803 |
| "std" | 11.715616 |
| "min" | 1.57 |
| "25%" | 3.69 |
| "50%" | 9.67 |
| "75%" | 17.130001 |
| "max" | 105.899971 |
percentiles = np.arange(5, 96, 5)
percentile_values = np.percentile(alpha_spend.select('spend').collect(), percentiles)
percentile_df = pl.DataFrame({
'Percentile Level': percentiles,
'Percentile': percentile_values})
(
GT(percentile_df, rowname_col='Percentile Level')
.tab_header(title='Alpha Spend Percentile')
.tab_stubhead(label="Percentile Level")
.fmt_number()
.data_color(
columns=['Percentile'],
domain=[0, 40],
palette=['white', 'rebeccapurple']
)
)| Alpha Spend Percentile | |
|---|---|
| Percentile Level | Percentile |
| 5 | 2.45 |
| 10 | 2.79 |
| 15 | 3.38 |
| 20 | 3.49 |
| 25 | 3.69 |
| 30 | 5.48 |
| 35 | 6.58 |
| 40 | 6.88 |
| 45 | 7.67 |
| 50 | 9.67 |
| 55 | 10.27 |
| 60 | 11.26 |
| 65 | 13.46 |
| 70 | 14.86 |
| 75 | 17.13 |
| 80 | 20.04 |
| 85 | 23.14 |
| 90 | 27.62 |
| 95 | 35.49 |
Having looked at some basic descriptive stats (as above), we will summarise the data using bins of width $2, right-censoring at $40.
labels = [f"{lb} - {lb + 2 if lb != 40 else '∞'}" for lb in range(-2, 41, 2)]
chart = freq_dist_plot(
data=alpha_spend,
column="spend",
bin_edges=np.arange(0, 41, 2),
labels=labels,
x_title="Brand Spend ($)",
y_title="% of Households",
chart_title="Alpha Spend Distribution (%)",
subtitle="Distribution of spend on Alpha across brand buyers",
label_angle=-45
)
chart.show()The general observation made earlier about category spend distribution about the shape of the distribution of spend holds. It is not so smooth, but this is a function of the smaller bins sizes. There is, however, one obvious aberration—the large spike for $2–$4. What is going on here? The average price of most Alpha SKUs is between $2 and $4. (Review average SKU prices of Alpha) We recall from Alpha’s distribution of purchase frequency plot that 48% of the panel made zero purchases of Alpha and 15% made one transaction. This implies that 0.15/(1 − 0.48) = 29% of Alpha buyers made just one purchase of the brand. Assuming they only purchased only one unit of one Alpha SKU on that purchase occasion, we would expect a large number of Alpha buyers to spend between $2 and $4; 24% is not unrealistic.
How would you determine the validity of the assumption that the buyers who purchased Alpha just once mostly bought only one unit of Alpha SKU on that purchase occasion?
# One-Time Alpha Buyers
one_time_alpha_buyers = trans_summary('Alpha',grocery_lf,1).filter(pl.col('# of Purchases') == 1).collect()
one_time_alpha_buyers
# Alpha buyers who spent between $2 to $4
spend_range_alpha_buyers = spend_summary('Alpha', grocery_lf, 1).filter(
(pl.col('spend') >= 2) &
(pl.col('spend') < 4)
).collect()
res = one_time_alpha_buyers.with_columns(
contains=pl.col('panel_id').is_in(spend_range_alpha_buyers.select('panel_id'))
).select('contains').sum().item()
print(f'There were {one_time_alpha_buyers.count().item(0,'panel_id')} one-time buyers of Alpha brand in Year 1.')
print(f'{res} of those one-time Alpha buyers spent between [$2, $4).')
print(f'Given that there are {spend_range_alpha_buyers.count().item(0,'spend')} Alpha buyers in total who spent between $2-$4,')
print(f'the {res} one-time buyers represent {res / spend_range_alpha_buyers.count().item(0,'spend'):.0%} of the Alpha buyers in that spend range.')There were 733 one-time buyers of Alpha brand in Year 1.
623 of those one-time Alpha buyers spent between [$2, $4).
Given that there are 633 Alpha buyers in total who spent between $2-$4,
the 623 one-time buyers represent 98% of the Alpha buyers in that spend range.
Validate:
spend_range_alpha_buyers.filter(
pl.col('panel_id').is_in(one_time_alpha_buyers.select('panel_id'))
).select(pl.col('spend').sum()).item()1937.2301025390625
grocery_lf.filter(
(pl.col('week') <= 52) &
(pl.col('panel_id').is_in(spend_range_alpha_buyers.select('panel_id'))) & # Assuming panel_id is the relevant column
(pl.col('panel_id').is_in(one_time_alpha_buyers.select('panel_id'))) & # Same assumption
(pl.col('brand') == 'Alpha')
).with_columns(
((pl.col('units') * pl.col('price'))).alias('spend')
).select(pl.col('spend').sum()).collect().item()1937.229736328125
category_vol = vol_summary('Category', grocery_lf, 1).drop('panel_id')
category_vol.describe()| statistic | volume |
|---|---|
| str | f64 |
| "count" | 4574.0 |
| "null_count" | 0.0 |
| "mean" | 4.544163 |
| "std" | 3.884697 |
| "min" | 0.25 |
| "25%" | 2.0 |
| "50%" | 3.5 |
| "75%" | 6.0 |
| "max" | 67.0 |
percentiles = np.arange(5, 96, 5)
with pl.StringCache():
percentile_values = np.percentile(category_vol.select('volume').collect(), percentiles)
percentile_df = pl.DataFrame({
'Percentile Level': percentiles,
'Percentile': percentile_values})
(
GT(percentile_df, rowname_col='Percentile Level')
.tab_header(title='Category Volume Percentile')
.tab_stubhead(label="Percentile Level")
.fmt_number()
.data_color(
columns=['Percentile'],
domain=[0, 15],
palette=['white', 'rebeccapurple']
)
)| Category Volume Percentile | |
|---|---|
| Percentile Level | Percentile |
| 5 | 0.83 |
| 10 | 1.00 |
| 15 | 1.00 |
| 20 | 1.50 |
| 25 | 2.00 |
| 30 | 2.00 |
| 35 | 2.50 |
| 40 | 3.00 |
| 45 | 3.00 |
| 50 | 3.50 |
| 55 | 4.00 |
| 60 | 4.38 |
| 65 | 5.00 |
| 70 | 5.50 |
| 75 | 6.00 |
| 80 | 7.00 |
| 85 | 8.00 |
| 90 | 9.00 |
| 95 | 12.00 |
labels = [f"{lb} - {lb + 1 if lb != 12 else '∞'}" for lb in range(-1, 13, 1)]
chart = freq_dist_plot(
data=category_vol,
column="volume",
bin_edges=np.arange(0, 13, 1),
labels=labels,
x_title="Category Volume Purchase ($)",
y_title="% of Households",
chart_title="Category Volume Purchase Distribution (%)",
subtitle="Distribution of category volume purchases across category buyers"
)
chart.show()Note that there are a number of 0 or NaN values in the Alpha volume purchase column. These panaellists (like the brand spend section) purchased in the category during the year, but they did not purchase any of Alpha’s products. Let us remove observations of panel_ids that have not at least made 1 purchase from Alpha brand during the year.
alpha_vol = vol_summary('Alpha', grocery_lf, 1).drop('panel_id')
alpha_vol.describe()| statistic | volume |
|---|---|
| str | f64 |
| "count" | 2624.0 |
| "null_count" | 0.0 |
| "mean" | 3.493236 |
| "std" | 3.350901 |
| "min" | 0.25 |
| "25%" | 1.0 |
| "50%" | 2.5 |
| "75%" | 5.0 |
| "max" | 27.55 |
percentiles = np.arange(5, 96, 5)
with pl.StringCache():
percentile_values = np.percentile(alpha_vol.select('volume').collect(), percentiles)
percentile_df = pl.DataFrame({
'Percentile Level': percentiles,
'Percentile': percentile_values})
(
GT(percentile_df, rowname_col='Percentile Level')
.tab_header(title='Alpha Volume Percentile')
.tab_stubhead(label="Percentile Level")
.fmt_number()
.data_color(
columns=['Percentile'],
domain=[0, 15],
palette=['white', 'rebeccapurple']
)
)| Alpha Volume Percentile | |
|---|---|
| Percentile Level | Percentile |
| 5 | 0.50 |
| 10 | 0.50 |
| 15 | 1.00 |
| 20 | 1.00 |
| 25 | 1.00 |
| 30 | 1.10 |
| 35 | 1.55 |
| 40 | 2.00 |
| 45 | 2.00 |
| 50 | 2.50 |
| 55 | 3.00 |
| 60 | 3.00 |
| 65 | 3.50 |
| 70 | 4.00 |
| 75 | 5.00 |
| 80 | 5.50 |
| 85 | 6.35 |
| 90 | 8.00 |
| 95 | 10.00 |
labels = [f"{lb} - {lb + 1 if lb != 9 else '∞'}" for lb in range(-1, 10, 1)]
chart = freq_dist_plot(
data=alpha_vol,
column="volume",
bin_edges=np.arange(0, 10, 1),
labels=labels,
x_title="Brand Volume Purchase ($)",
y_title="% of Households",
chart_title="Alpha Brand Volume Purchase Distribution (%)",
subtitle="Distribution of Alpha volume purchases across brand buyers"
)
chart.show()Garbage Collect
exceptions = [
'grocery_lf', 'sku_lf', 'kiwi_lf',
'In', 'exceptions', 'active_variables',
'penetration', 'ppb', 'panel_size']
active_variables = [
var for var, value in globals().items()
if not var.startswith('_') # Exclude variables that start with "_"
and var not in exceptions # Exclude variables in the exceptions list
and isinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only
]
for var in active_variables:
del globals()[var]
del active_variables, exceptions, var
gc.collect()127
We have observed that there is a lot of variability in purchase frequency and spend, be it at the product or category level. A useful way of expressing the variability is via a decile analysis. As the name suggests, this sees us dividing the customer base into 10 equally sized groups and then summarising each group’s buying behaviour. We will consider two versions of this analysis.
We will first focus on creating the table below. The table captures the decile analysis of category buying behaviour (where each decile equals 10% category buyers). Recall that 4574 households made at least one category purchase in year 1. Having sorted these households by total category spend, we create 10 equally sized groups. Decile 1 is the 10% of households that spent the most in the category during year 1, decile 2 is the next largest 10% of spenders, and so on.
# Create decile cuts in the group_spend data using the sorted and ranked data
# Part 1: Break-tie ranking of total category spend data
# Part 2: Decile cuts in ranking
buyer_decile = spend_summary(
'Category', grocery_lf, 1
).with_columns(
pl.col('spend').rank(method='ordinal', descending=True).alias('rank'),
).with_columns(
(np.floor(10 * (pl.col('rank') - 1) / pl.col('rank').max()) + 1).cast(pl.UInt16).alias('decile')
).sort('decile', 'rank').collect()
buyer_decile| panel_id | spend | rank | decile |
|---|---|---|---|
| u32 | f32 | u32 | u16 |
| 3111594 | 166.699936 | 1 | 1 |
| 3119170 | 108.829994 | 2 | 1 |
| 3125418 | 102.87001 | 3 | 1 |
| 3110132 | 100.630013 | 4 | 1 |
| 3109647 | 97.37999 | 5 | 1 |
| … | … | … | … |
| 3128086 | 1.49 | 4570 | 10 |
| 3112878 | 1.39 | 4571 | 10 |
| 3120986 | 1.33 | 4572 | 10 |
| 3104765 | 1.29 | 4573 | 10 |
| 3125737 | 1.27 | 4574 | 10 |
# Helper functions
def calculate_percentage(series):
return series / series.sum()
def group_aggregate(data, group_col, agg_col, agg_expr):
return data.group_by(group_col).agg(agg_expr(pl.col(agg_col)))
# Decile table creation
decile_tab = buyer_decile.select("decile").unique().sort("decile").rename({"decile": "Decile"})
# Populate decile statistics
# Decile counts and % HH
decile_counts = group_aggregate(buyer_decile, 'decile', 'panel_id', pl.Expr.count)
decile_tab.insert_column(1, pl.lit(calculate_percentage(decile_counts['panel_id'])).alias('% HH'))
# % Spend
decile_spend = group_aggregate(buyer_decile, 'decile', 'spend', pl.Expr.sum)
decile_tab.insert_column(2, pl.lit(calculate_percentage(decile_spend['spend'])).alias('% Spend'))
# % Transactions
group_trans = trans_summary('Category', grocery_lf, 1).collect().join(
other=buyer_decile.group_by('panel_id').agg(pl.col('decile').sum()),
on='panel_id'
).sort('decile')
decile_trans = group_aggregate(group_trans, 'decile', '# of Purchases', pl.Expr.sum)
decile_tab.insert_column(3, pl.lit(calculate_percentage(decile_trans['# of Purchases'])).alias('% Trans'))
# Spend/HH, Cat Trans/HH, Avg Spend/Trans (AOV)
decile_tab.insert_column(4, pl.lit(decile_spend['spend'] / decile_counts['panel_id']).alias('Spend/HH'))
decile_tab.insert_column(5, pl.lit(decile_trans['# of Purchases'] / decile_counts['panel_id']).alias('Cat Trans/HH'))
decile_tab.insert_column(6, pl.lit(decile_spend['spend'] / decile_trans['# of Purchases']).alias('Avg Spend/Trans'))
# Unique Brands Purchased
unique_brands = group_aggregate(group_trans, 'decile', 'Brands Purchased', pl.Expr.mean)
decile_tab.insert_column(7, pl.lit(unique_brands['Brands Purchased']).alias('# Unique Brands'))
(
GT(decile_tab, rowname_col='Decile')
.tab_header(title='Decile Analysis of Category Buying Behavior',
subtitle= 'Each decile equals 10% of category buyers')
.tab_stubhead(label="Decile")
.fmt_percent(columns=['% HH', '% Spend', '% Trans'], decimals=1)
.fmt_number(columns=['Cat Trans/HH', '# Unique Brands'])
.fmt_currency(columns=['Spend/HH','Avg Spend/Trans'])
)| Decile Analysis of Category Buying Behavior | |||||||
|---|---|---|---|---|---|---|---|
| Each decile equals 10% of category buyers | |||||||
| Decile | % HH | % Spend | % Trans | Spend/HH | Cat Trans/HH | Avg Spend/Trans | # Unique Brands |
| 1 | 10.0% | 28.4% | 24.3% | $44.65 | 10.62 | $4.21 | 1.85 |
| 2 | 10.0% | 17.2% | 16.3% | $27.15 | 7.14 | $3.80 | 1.64 |
| 3 | 10.0% | 13.3% | 13.5% | $20.96 | 5.88 | $3.56 | 1.58 |
| 4 | 10.0% | 10.7% | 11.1% | $16.91 | 4.87 | $3.47 | 1.54 |
| 5 | 10.0% | 8.8% | 9.5% | $13.81 | 4.15 | $3.33 | 1.50 |
| 6 | 10.0% | 7.1% | 7.8% | $11.14 | 3.41 | $3.27 | 1.41 |
| 7 | 10.0% | 5.7% | 6.7% | $8.98 | 2.94 | $3.05 | 1.37 |
| 8 | 10.0% | 4.3% | 4.9% | $6.74 | 2.13 | $3.16 | 1.22 |
| 9 | 10.0% | 2.9% | 3.7% | $4.56 | 1.62 | $2.82 | 1.21 |
| 10 | 10.0% | 1.7% | 2.3% | $2.70 | 1.02 | $2.65 | 1.00 |
Reading across the row associated with Decile 1, we see that they accounted for 28% of category spend and 24% of total transactions. On average, they spent $44.65 in the category across an average of 10.6 purchase occasions. This corresponds to an average category spend per category transaction of $4.21. On average, these households purchased 1.8 different brands during the year. Contrast this to the 10% of category buyers that spent the least in the category. They account for 2% of category spend and transactions, making on average one category purchase. And so on.
This is a very basic decile table. Some additional information that could be reported includes the average number of units purchased per transaction and the average number of unique SKUs purchased in the year.
The decile analysis we have just completed uses deciles that represent 10% of the category buyers. An alternative approach is to create deciles that represent 10% of category spend. In other words, the decile analysis took the entire customer base that purchased the category in year 1 and split the base into ten equal group. In this analysis, we will take the total customer spend or total revenue and split that revenue into ten equal groups. The only change to what we have done above is how we create the decile variable.
# Divide the total revenue into 10 equal parts and generate the decile bins
spend_total = spend_summary('Category', grocery_lf, 1).select(pl.col('spend')).sum().collect().to_series()[0]
breaks = np.arange(0, spend_total, spend_total/10)
labels = [f'{i}' for i in range(0, 11, 1)]
spend_decile = spend_summary(
'Category', grocery_lf, 1
).sort(
'spend', descending=True
).with_columns(
pl.col('spend').cum_sum().alias('cum sum') # Create a cumulative sum column for category spend (category revenue)
).with_columns(
pl.col('cum sum').cut( # Cut function determines the decile range for each cumulative sum value
breaks=breaks,
labels=labels # Function also adds the decile label from 1 to 10
).cast(pl.UInt16).alias('decile')
).sort('decile').collect()
spend_decile| panel_id | spend | cum sum | decile |
|---|---|---|---|
| u32 | f32 | f32 | u16 |
| 3111594 | 166.699936 | 166.699936 | 1 |
| 3119170 | 108.829994 | 275.529938 | 1 |
| 3125418 | 102.87001 | 378.399963 | 1 |
| 3110132 | 100.630013 | 479.029968 | 1 |
| 3109647 | 97.37999 | 576.409973 | 1 |
| … | … | … | … |
| 3128086 | 1.49 | 72096.984375 | 10 |
| 3112878 | 1.39 | 72098.375 | 10 |
| 3120986 | 1.33 | 72099.703125 | 10 |
| 3104765 | 1.29 | 72100.992188 | 10 |
| 3125737 | 1.27 | 72102.265625 | 10 |
# Helper functions
def calculate_percentage(series):
return series / series.sum()
def group_aggregate(data, group_col, agg_col, agg_expr):
return data.group_by(group_col).agg(agg_expr(pl.col(agg_col)))
# Decile table creation
decile_tab_spend = spend_decile.select("decile").unique().sort("decile").rename({"decile": "Decile"})
# Populate decile statistics
# Decile counts and % HH
decile_counts = group_aggregate(spend_decile, 'decile', 'panel_id', pl.Expr.count)
decile_tab_spend.insert_column(1, pl.lit(calculate_percentage(decile_counts['panel_id'])).alias('% HH'))
# % Spend
decile_spend = group_aggregate(spend_decile, 'decile', 'spend', pl.Expr.sum)
decile_tab_spend.insert_column(2, pl.lit(calculate_percentage(decile_spend['spend'])).alias('% Spend'))
# % Transactions
group_trans = trans_summary('Category', grocery_lf, 1).collect().join(
other=spend_decile.group_by('panel_id').agg(pl.col('decile').sum()),
on='panel_id'
).sort('decile')
decile_trans = group_aggregate(group_trans, 'decile', '# of Purchases', pl.Expr.sum)
decile_tab_spend.insert_column(3, pl.lit(calculate_percentage(decile_trans['# of Purchases'])).alias('% Trans'))
# Spend/HH, Cat Trans/HH, Avg Spend/Trans (AOV)
decile_tab_spend.insert_column(4, pl.lit(decile_spend['spend'] / decile_counts['panel_id']).alias('Spend/HH'))
decile_tab_spend.insert_column(5, pl.lit(decile_trans['# of Purchases'] / decile_counts['panel_id']).alias('Cat Trans/HH'))
decile_tab_spend.insert_column(6, pl.lit(decile_spend['spend'] / decile_trans['# of Purchases']).alias('Avg Spend/Trans'))
# Unique Brands Purchased
unique_brands = group_aggregate(group_trans, 'decile', 'Brands Purchased', pl.Expr.mean)
decile_tab_spend.insert_column(7, pl.lit(unique_brands['Brands Purchased']).alias('# Unique Brands'))
(
GT(decile_tab_spend, rowname_col='Decile')
.tab_header(title='Decile Analysis of Category Buying Behavior',
subtitle= 'Each decile equals 10% of category spend/revenue')
.tab_stubhead(label="Decile")
.fmt_percent(columns=['% HH', '% Spend', '% Trans'], decimals=1)
.fmt_number(columns=['Cat Trans/HH', '# Unique Brands'])
.fmt_currency(columns=['Spend/HH','Avg Spend/Trans'])
)| Decile Analysis of Category Buying Behavior | |||||||
|---|---|---|---|---|---|---|---|
| Each decile equals 10% of category spend/revenue | |||||||
| Decile | % HH | % Spend | % Trans | Spend/HH | Cat Trans/HH | Avg Spend/Trans | # Unique Brands |
| 1 | 2.5% | 10.0% | 7.6% | $63.77 | 13.44 | $4.74 | 1.92 |
| 2 | 3.7% | 10.0% | 8.9% | $42.33 | 10.50 | $4.03 | 1.82 |
| 3 | 4.6% | 10.0% | 9.2% | $33.97 | 8.73 | $3.89 | 1.78 |
| 4 | 5.6% | 10.0% | 9.4% | $28.21 | 7.36 | $3.83 | 1.63 |
| 5 | 6.6% | 10.0% | 9.9% | $23.69 | 6.50 | $3.65 | 1.65 |
| 6 | 7.9% | 10.0% | 10.2% | $20.01 | 5.66 | $3.53 | 1.57 |
| 7 | 9.5% | 10.0% | 10.4% | $16.66 | 4.80 | $3.47 | 1.54 |
| 8 | 11.7% | 10.0% | 10.9% | $13.49 | 4.08 | $3.31 | 1.48 |
| 9 | 15.7% | 10.0% | 11.2% | $10.06 | 3.14 | $3.21 | 1.39 |
| 10 | 32.2% | 10.0% | 12.3% | $4.89 | 1.67 | $2.93 | 1.16 |
Reading across the row associated with Decile 1, we see that the top 2% of households accounted for 10% of category spend. On average, they spent $63.77 in the category, across an average of 13.4 purchase occasions. This corresponds to an average category spend per category transaction of $4.74. On average, these households purchased 1.9 different brands during the year.
A Lorenz curve is a common graphical tool for visualising “concentration” or “inequality” in the distribution of a quantity of interest (e.g., income, buying behaviour). It show the proportion of the overall quantity (e.g., number of transactions, spend) associated with the bottom x% of the unit of observation associated with the distribution (e.g., households). When analysing buyer behaviour, this sees us lining up all customers in ascending order of their level of purchasing and computing the share of total purchasing accounted by each person. The Lorenz curve is created by plotting the cumulative percentage of customers (x-axis) against the cumulative percentage of total purchasing (y-axis).
The Lorenz curve for transactions associated with Alpha is computed and plotted below and interpreted in the following manner. We see that that x = 80% roughly corresponds to y = 54%, which means the 80% of the buyers of Alpha (when sorted from least to most frequent buyers) account for 54% of all the buying of Alpha in year 1. This implies the top 20% of buyers (in terms of purchase frequency) account for 46% of total purchases. The “rule” of 80/20 does not hold here; rather, it is 46/20.
# We are only interested in those panellists that made at least one purchase of Alpha in year 1.
# Alpha Brand Transaction Distribution of Transactions Greater Than 0
alpha_trans_dist = (
trans_summary('Alpha', grocery_lf, 1)
.filter(pl.col('# of Purchases') > 0)
.group_by('# of Purchases')
# Frequency of the Grouped Transaction Count
.agg(pl.col('# of Purchases').count().alias('Frequency'))
.sort('# of Purchases')
# Add a '% of Buyers' column
.with_columns((pl.col('Frequency')/pl.col('Frequency').sum()).alias('% of Buyers'))
# Cumulative % of Buyers
.with_columns((pl.col('% of Buyers').cum_sum()).alias('Cum % of Buyers'))
# Total number of transactions made by those customers who made one purchase of Alpha
# Total Transactions = frequency x # of transactions
.with_columns((pl.col('Frequency')*pl.col('# of Purchases')).alias('Total Transactions'))
# Add a '% of Transactions' column
.with_columns((pl.col('Total Transactions')/pl.col('Total Transactions').sum()).alias('% Transactions'))
# Cumulative % of Transactions
.with_columns((pl.col('% Transactions').cum_sum()).alias('Cum % Transactions'))
)
(
GT(alpha_trans_dist.collect(), rowname_col='# of Purchases')
.tab_header(title='Alpha Buyers & Transaction Distribution',
subtitle= 'Lorenz Curve Table')
.tab_stubhead(label="# of Purchases")
.fmt_percent(columns=['% of Buyers', 'Cum % of Buyers', '% Transactions', 'Cum % Transactions'], decimals=1)
.fmt_integer(columns=['Frequency', 'Total Transactions'])
)| Alpha Buyers & Transaction Distribution | ||||||
|---|---|---|---|---|---|---|
| Lorenz Curve Table | ||||||
| # of Purchases | Frequency | % of Buyers | Cum % of Buyers | Total Transactions | % Transactions | Cum % Transactions |
| 1 | 733 | 27.9% | 27.9% | 733 | 8.1% | 8.1% |
| 2 | 517 | 19.7% | 47.6% | 1,034 | 11.4% | 19.5% |
| 3 | 400 | 15.2% | 62.9% | 1,200 | 13.2% | 32.7% |
| 4 | 277 | 10.6% | 73.4% | 1,108 | 12.2% | 45.0% |
| 5 | 227 | 8.7% | 82.1% | 1,135 | 12.5% | 57.5% |
| 6 | 144 | 5.5% | 87.6% | 864 | 9.5% | 67.0% |
| 7 | 101 | 3.8% | 91.4% | 707 | 7.8% | 74.8% |
| 8 | 71 | 2.7% | 94.1% | 568 | 6.3% | 81.1% |
| 9 | 47 | 1.8% | 95.9% | 423 | 4.7% | 85.8% |
| 10 | 28 | 1.1% | 97.0% | 280 | 3.1% | 88.9% |
| 11 | 24 | 0.9% | 97.9% | 264 | 2.9% | 91.8% |
| 12 | 16 | 0.6% | 98.5% | 192 | 2.1% | 93.9% |
| 13 | 16 | 0.6% | 99.1% | 208 | 2.3% | 96.2% |
| 14 | 12 | 0.5% | 99.6% | 168 | 1.9% | 98.1% |
| 15 | 6 | 0.2% | 99.8% | 90 | 1.0% | 99.1% |
| 16 | 3 | 0.1% | 99.9% | 48 | 0.5% | 99.6% |
| 18 | 1 | 0.0% | 100.0% | 18 | 0.2% | 99.8% |
| 20 | 1 | 0.0% | 100.0% | 20 | 0.2% | 100.0% |
% of buyers and % of transactions quantities We see, for example, that 48% of all buyers made two or fewer purchases and accounted for 20% of all the purchases of Alpha.zero_row = pl.DataFrame({
'Cum % of Buyers': [0.0],
'Cum % Transactions': [0.0]
})
alpha_lorenz_plot = alpha_trans_dist.select('Cum % of Buyers' ,'Cum % Transactions').collect()
alpha_lorenz_plot_zero = pl.concat([zero_row, alpha_lorenz_plot])
chart = alt.Chart(alpha_lorenz_plot_zero).mark_line().encode(
x=alt.X('Cum % of Buyers:Q', title='Cumulative % Alpha Buyers', axis=alt.Axis(format='.0%')),
y=alt.Y('Cum % Transactions:Q',
title='Cumulative % Alpha Transactions',
axis=alt.Axis(format='.0%'),
scale=alt.Scale(domain=[0, 1])),
).properties(
title="Lorenz Curve for Alpha Transactions",
width=550,
height=450
)
chart.show()As noted above, we see that x = 80% roughly corresponds to y = 54%. We can compute the exact number by interpolation (linear) in the following manner:
\[ \frac{y-y_{0}}{x-x_{0}}=\frac{y_{1}-y_{0}}{x_{1}-x_{0}} \]
\[ (y-y_{0})=\frac{(y_{1}-y_{0})}{(x_{1}-x_{0})}(x-x_{0}) \]
# Numpy linear interpolation function simplifies the computation, return the interpolated y value
np.interp(0.8, xp=alpha_lorenz_plot['Cum % of Buyers'], fp=alpha_lorenz_plot['Cum % Transactions'])np.float64(0.5448123620309051)
A related quantity of interest is the percentage of buyers that account for half of total purchasing. This can be read off the Lorenz curve in the following manner.
# Reversing the axis returns the interpolated x value
np.interp(0.5, xp=alpha_lorenz_plot['Cum % Transactions'], fp=alpha_lorenz_plot['Cum % of Buyers'])np.float64(0.7690548780487805)
Purchase frequency is a discrete quantity and we created the Lorenz curve off the distribution of transactions. We now consider how to create a Lorenz curve when the quantity of interest is continuous. We will focus on creating the Lorenz curve for the spend associated with Alpha.
alpha_spend_dist = (
spend_summary('Alpha', grocery_lf, 1)
.filter(pl.col('spend') > 0)
# Sort Alpha customer from lowest to highest
.sort(pl.col('spend'), descending=False)
# Compute spend as a % of total Alpha brand spend in year 1
# percentage of total Alpha purchasing accounted for by each panellist
.with_columns((pl.col('spend')/pl.col('spend').sum()).alias('% Spend'))
# Compute the cumulative percentage of spend numbers
.with_columns((pl.col('% Spend').cum_sum()).cast(pl.Float64).alias('Cum % Spend'))
).collect()
# Compute the cumulative percentage of buyers number
spend_count = alpha_spend_dist.height
alpha_spend_dist = alpha_spend_dist.with_columns(
pl.lit(np.linspace(1 / spend_count, 1, spend_count)).alias('Cum % Cust')
)
alpha_spend_dist| panel_id | spend | % Spend | Cum % Spend | Cum % Cust |
|---|---|---|---|---|
| u32 | f32 | f32 | f64 | f64 |
| 3124392 | 1.57 | 0.000047 | 0.000047 | 0.000381 |
| 3126803 | 1.57 | 0.000047 | 0.000094 | 0.000762 |
| 3123698 | 1.57 | 0.000047 | 0.00014 | 0.001143 |
| 3126471 | 1.57 | 0.000047 | 0.000187 | 0.001524 |
| 3117751 | 1.57 | 0.000047 | 0.000234 | 0.001905 |
| … | … | … | … | … |
| 3105099 | 80.5 | 0.002398 | 0.988765 | 0.998476 |
| 3111813 | 84.020004 | 0.002503 | 0.991268 | 0.998857 |
| 3107289 | 92.029999 | 0.002741 | 0.994009 | 0.999238 |
| 3110400 | 95.199997 | 0.002836 | 0.996845 | 0.999619 |
| 3111594 | 105.899971 | 0.003155 | 1.0 | 1.0 |
zero_row = pl.DataFrame({
'Cum % Spend': [0.0],
'Cum % Cust': [0.0]
})
alpha_lorenz_plot = alpha_spend_dist.select('Cum % Spend' ,'Cum % Cust')
alpha_lorenz_plot_zero = pl.concat([zero_row, alpha_lorenz_plot])
chart = alt.Chart(alpha_lorenz_plot_zero).mark_line().encode(
x=alt.X('Cum % Cust:Q', title='Cumulative % Alpha Buyers', axis=alt.Axis(format='.0%')),
y=alt.Y('Cum % Spend:Q',
title='Cumulative % Alpha Revenue',
axis=alt.Axis(format='.0%'),
scale=alt.Scale(domain=[0, 1])),
).properties(
title="Lorenz Curve for Alpha Spend",
width=550,
height=450
)
chart.show()Looking at alpha_spend_dist, we see that the “bottom” 80% of Alpha buyers account for 50.8% of its revenue in year 1, which is equivalent to 49/20. This is higher than the 46/20 we observed for transactions.
np.interp(0.8, xp=alpha_lorenz_plot['Cum % Cust'], fp=alpha_lorenz_plot['Cum % Spend'])np.float64(0.5083526968955995)
Garbage Collect
exceptions = [
'grocery_lf', 'sku_lf', 'kiwi_lf',
'In', 'exceptions', 'active_variables',
'penetration', 'ppb', 'panel_size']
active_variables = [
var for var, value in globals().items()
if not var.startswith('_') # Exclude variables that start with "_"
and var not in exceptions # Exclude variables in the exceptions list
and isinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only
]
for var in active_variables:
del globals()[var]
del active_variables, exceptions, var
gc.collect()74
So far, we have explored purchasing at the level of the individual brand or overall category. We now step back and consider panellists’ purchasing of multiple brands (in a given time period).
To set the scene, let us first determine the percentage of category buyers that bought 1, 2, 3, 4, or 5 different brands in the course of year 1. (Note that we only have five brands in the dataset, with Other being an aggregation of several very small share brands).
brand_purchases = (
trans_summary('Category', grocery_lf, 1)
.group_by('Brands Purchased')
.agg(pl.col('panel_id').count().alias('Category Buyer'))
.with_columns((pl.col('Category Buyer') / pl.col('Category Buyer').sum()).alias('% Category Buyers'))
.sort('Brands Purchased')
).collect()
(
GT(brand_purchases, rowname_col='Brands Purchased')
.tab_header(title='Category Buyers That Bought Different Brands',
subtitle='Distribution of the number of separate brands purchased by category buyers in year 1')
.tab_stubhead(label="Brands Purchased")
.fmt_integer(columns=['Category Buyer'])
.fmt_percent(columns=['% Category Buyers'], decimals=0)
.data_color(
columns=['% Category Buyers'],
domain=[0, 0.7],
palette=['white', 'rebeccapurple']
)
)| Category Buyers That Bought Different Brands | ||
|---|---|---|
| Distribution of the number of separate brands purchased by category buyers in year 1 | ||
| Brands Purchased | Category Buyer | % Category Buyers |
| 1 | 2,964 | 65% |
| 2 | 1,278 | 28% |
| 3 | 293 | 6% |
| 4 | 39 | 1% |
We note that over two-thirds of the category buyers only ever bought one brand. This is despite the fact that 86% of category buyers made two or more category purchases in the course of the year. No household purchased all five brands that year.
Note: We know from analyzing the distribution of category purchase frequency that 8.9% of the panel made zero category purchases and 13.2% made one category purchase. 1 − 0.132/(1 − 0.089) = 0.855, which means 86% of year 1 category buyers made more than one category purchase that year.
Can also be computed as: (1 - 0.132 - 0.089)/(1 - 0.089) [% of panellists that more than 1 purchase] / [% of category buyers]
How does the number of different brands purchased in the year vary as a function of the number of category purchases made during the year?
# number of category purchases made by each household and number of brands purchased by each household
brand_loyalty = (
trans_summary('Category', grocery_lf, 1)
.group_by('# of Purchases', 'Brands Purchased')
.agg(pl.col('panel_id').count().alias('Count'))
.collect()
.pivot(
on='Brands Purchased',
index='# of Purchases',
values='Count'
).sort(pl.col('# of Purchases'))
.with_columns(pl.col('*'))
.select('# of Purchases', '1', '2', '3', '4')
)
(
GT(brand_loyalty, rowname_col='# of Purchases')
.tab_header(title='# of Purchase Occasions Vs. Brands Purchased ',
subtitle='Brands purchased as function of the number of category purchases made in year 1')
.tab_stubhead(label="# of Purchases")
.sub_missing(missing_text="")
.fmt_integer()
.data_color(
columns=['1', '2', '3', '4'],
domain=[0, 700],
palette=['white', 'rebeccapurple'],
na_color='white'
)
)| # of Purchase Occasions Vs. Brands Purchased | ||||
|---|---|---|---|---|
| Brands purchased as function of the number of category purchases made in year 1 | ||||
| # of Purchases | 1 | 2 | 3 | 4 |
| 1 | 655 | 12 | ||
| 2 | 573 | 184 | 2 | |
| 3 | 516 | 216 | 24 | 1 |
| 4 | 370 | 193 | 47 | 2 |
| 5 | 278 | 189 | 43 | 4 |
| 6 | 196 | 124 | 53 | 8 |
| 7 | 121 | 93 | 28 | 5 |
| 8 | 87 | 94 | 25 | 3 |
| 9 | 51 | 51 | 22 | 3 |
| 10 | 42 | 34 | 9 | |
| 11 | 24 | 24 | 6 | 3 |
| 12 | 16 | 21 | 9 | 4 |
| 13 | 11 | 9 | 6 | 1 |
| 14 | 13 | 9 | 8 | 1 |
| 15 | 4 | 8 | 3 | |
| 16 | 3 | 6 | 4 | 3 |
| 17 | 1 | 3 | 3 | |
| 18 | 1 | 3 | ||
| 19 | 2 | |||
| 20 | 2 | 1 | 1 | |
| 22 | 1 | |||
| 25 | 1 | |||
| 27 | 1 | |||
Looking at the resulting table output, we see that there is quite a high level of sole-brand loyalty (i.e., only buying one brand), even as the number of category purchases increases. We have two panellists that made 20 category purchases, all with the same brand. That’s some level of loyalty!
We can compute the (weighted) average number of different brands purchased for each level of category purchasing.
wa_brands_purchased = (
trans_summary('Category', grocery_lf, 1)
.group_by('# of Purchases', 'Brands Purchased')
.agg(pl.col('panel_id').count().alias('Count'))
.with_columns((pl.col('Brands Purchased') * pl.col('Count')).alias('Product'))
.group_by('# of Purchases')
.agg(
(pl.col('Product').sum() / pl.col('Count').sum()) .alias('Weighted Average')
).sort(by='# of Purchases')
).collect()
(
GT(wa_brands_purchased, rowname_col='# of Purchases')
.tab_header(title='W.A. # of Different Brands Purchased')
.tab_stubhead(label="# of Purchases")
.fmt_number()
.data_color(
columns=['Weighted Average'],
domain=[1, 4],
palette=['white', 'rebeccapurple']
)
.cols_label({'Weighted Average': 'Weighted Average Brands'})
)| W.A. # of Different Brands Purchased | |
|---|---|
| # of Purchases | Weighted Average Brands |
| 1 | 1.02 |
| 2 | 1.25 |
| 3 | 1.35 |
| 4 | 1.48 |
| 5 | 1.56 |
| 6 | 1.67 |
| 7 | 1.66 |
| 8 | 1.73 |
| 9 | 1.82 |
| 10 | 1.61 |
| 11 | 1.79 |
| 12 | 2.02 |
| 13 | 1.89 |
| 14 | 1.90 |
| 15 | 1.93 |
| 16 | 2.44 |
| 17 | 2.29 |
| 18 | 1.75 |
| 19 | 2.00 |
| 20 | 1.75 |
| 22 | 2.00 |
| 25 | 2.00 |
| 27 | 4.00 |
We note that the average number of brands purchased does increase as a function of category purchasing. This should not come as a surprise, as more category purchases equals more opportunities to buy different brands. We saw in earlier that two brands (Alpha and Bravo) had a combined value market share of 86%. As such, the relatively low number of different brands purchased in the year is not too surprising.
We now consider three common analyses designed to give insight into the nature of multibrand buying behaviour.
We know from our analysis of penetration and PPB earlier that 52% of households purchased Alpha and 51% of households purchased Bravo. What percentage of Alpha buyers also purchased Bravo during the year (and vice-versa)? The answer to these (and similar) questions is provided by a duplication of purchase analysis.
Before performing this analysis on our dataset, let us consider a toy example. The following table summarizes the purchasing of four brands by six households.
| A | B | C | D | |
|---|---|---|---|---|
| HH01 | 1 | 0 | 2 | 0 |
| HH02 | 0 | 1 | 0 | 0 |
| HH03 | 1 | 3 | 0 | 0 |
| HH04 | 0 | 0 | 1 | 4 |
| HH05 | 1 | 1 | 0 | 1 |
| HH06 | 0 | 0 | 0 | 1 |
We see that three households made at least one purchase of brand A, three households made at least one purchase of brand B, and so on. How many brand A buyers also purchased brand B? Two (HH03 and HH05). How many brand A buyers also purchased brand C? One (HH01). Repeating this for all brands gives us the following table, which we will call a duplication count table.
| A | B | C | D | |
|---|---|---|---|---|
| A | 3 | 2 | 1 | 1 |
| B | 2 | 3 | 0 | 1 |
| C | 1 | 0 | 2 | 1 |
| D | 1 | 1 | 1 | 3 |
For any given row, the number in each cell is the number of buyers of the brand associated with that row that also purchased the brand associated with that column. (The diagonal is obviously the number of buyers of each brand.) Looking at the row for brand A, we see that three households purchased that brand. Two of these households (67%) also made at least one purchase of brand B, and one of these three households (33%) also made at least one purchase of brand C. These row percentages are reported in the following table, which we call the duplication of purchase table. (By convention, we leave the diagonal blank.)
| A | B | C | D | |
|---|---|---|---|---|
| A | 67% | 33% | 33% | |
| B | 67% | 0% | 33% | |
| C | 50% | 0% | 50% | |
| D | 33% | 33% | 33% |
How can we create this table efficiently when we have a large number of panellists? One approach, which makes use of matrix multiplication, is as follows: - We create what we will call an “ever buyers” matrix (or incidence matrix), which is a matrix of size (number of panellists) × (number of brands), where each cell takes on a value of 1 if the panellist (row) ever purchased the brand (column) in the period of interest; 0 otherwise.
| A | B | C | D | |
|---|---|---|---|---|
| HH01 | 1 | 0 | 1 | 0 |
| HH02 | 0 | 1 | 0 | 0 |
| HH03 | 1 | 1 | 0 | 0 |
| HH04 | 0 | 0 | 1 | 1 |
| HH05 | 1 | 1 | 0 | 1 |
| HH06 | 0 | 0 | 0 | 1 |
Transposed “ever buyer” matrix:
| HH01 | HH02 | HH03 | HH04 | HH05 | HH06 | |
|---|---|---|---|---|---|---|
| A | 1 | 0 | 1 | 0 | 1 | 0 |
| B | 0 | 1 | 1 | 0 | 1 | 0 |
| C | 1 | 0 | 0 | 1 | 0 | 0 |
| D | 0 | 0 | 0 | 1 | 1 | 1 |
Matrix Multiply By:
| A | B | C | D | |
|---|---|---|---|---|
| HH01 | 1 | 0 | 1 | 0 |
| HH02 | 0 | 1 | 0 | 0 |
| HH03 | 1 | 1 | 0 | 0 |
| HH04 | 0 | 0 | 1 | 1 |
| HH05 | 1 | 1 | 0 | 1 |
| HH06 | 0 | 0 | 0 | 1 |
Matrix Multiplication Visualizer
# m x n matrix where m = # of panellists, n = # of brands and value is number of purchases
brand_purch = np.array([
[1, 0, 2, 0],
[0, 1, 0, 0],
[1, 3, 0, 0],
[0, 0, 1, 4],
[1, 1, 0, 1],
[0, 0, 0, 1]
])
# m x n matrix where value is 1 if buyer ever purchased, 0 if never purchased
ever_buyers = np.where(brand_purch != 0, 1, 0)
# n x n square matrix representing count of row/column brand buyers who also purchased column/row brands
dup_count = ever_buyers.T @ ever_buyers
print('Duplication Count Matrix:')
print(dup_count, dup_count.shape)
print()
# Diagonal vector of dup_count matrix representing the number of buyers of each ROW brand
brand_buyers = np.diag(dup_count).reshape(4, 1)
print('Number of Buyers of Each Brand:')
print(brand_buyers, brand_buyers.shape)
# n x n square matrix representing the proportion of brand buyers for each ROW brand that also purchased other COLUMN brands
dup_purchase = dup_count / brand_buyers
print('Duplication of Purchase Matrix:')
print(dup_purchase, dup_purchase.shape)Duplication Count Matrix:
[[3 2 1 1]
[2 3 0 1]
[1 0 2 1]
[1 1 1 3]] (4, 4)
Number of Buyers of Each Brand:
[[3]
[3]
[2]
[3]] (4, 1)
Duplication of Purchase Matrix:
[[1. 0.66666667 0.33333333 0.33333333]
[0.66666667 1. 0. 0.33333333]
[0.5 0. 1. 0.5 ]
[0.33333333 0.33333333 0.33333333 1. ]] (4, 4)
brand_purchases = trans_pivot(grocery_lf, 1).select('Alpha', 'Bravo', 'Charlie', 'Delta', 'Other')
# Step 1: Create ever buyers matrix
ever_buyers = brand_purchases.select(
[pl.when(pl.col(col) > 0).then(1).otherwise(0).alias(col) for col in brand_purchases.columns]
).to_numpy()
# Step 2: Compute duplication count matrix
dup_count = ever_buyers.T @ ever_buyers
brand_buyers = np.diag(dup_count).reshape(5, 1)
# Step 3: Compute duplication of purchase matrix
dup_purchase = dup_count / brand_buyers
# DataFrame conversion for tabular view
dup_count_df = (
pl.from_numpy(dup_count, schema=brand_purchases.columns)
.with_columns(pl.Series('Brands', brand_purchases.columns))
)
dup_purchase_df = (
pl.from_numpy(dup_purchase, schema=brand_purchases.columns)
.with_columns(pl.Series('Brands', brand_purchases.columns))
)(
GT(dup_count_df, rowname_col='Brands')
.tab_header(title='Duplication Count Matrix',
subtitle='Count of buyers who purchased row AND column brands')
.tab_stubhead(label="Brands")
.sub_missing(missing_text="")
.fmt_integer()
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 900],
palette=['white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('lightgrey'),
],
locations=[
loc.body(['Alpha'],['Alpha']),
loc.body(['Bravo'],['Bravo']),
loc.body(['Charlie'],['Charlie']),
loc.body(['Delta'],['Delta']),
loc.body(['Other'],['Other'])
]
)
.tab_source_note(md("""**Read Row/Column as**: _# of People who Purchased [Brand] / Who also Purchased [Brand]_"""))
)| Duplication Count Matrix | |||||
|---|---|---|---|---|---|
| Count of buyers who purchased row AND column brands | |||||
| Brands | Alpha | Bravo | Charlie | Delta | Other |
| Alpha | 2,624 | 896 | 403 | 239 | 69 |
| Bravo | 896 | 2,562 | 382 | 140 | 110 |
| Charlie | 403 | 382 | 813 | 116 | 26 |
| Delta | 239 | 140 | 116 | 380 | 10 |
| Other | 69 | 110 | 26 | 10 | 176 |
| Read Row/Column as: # of People who Purchased [Brand] / Who also Purchased [Brand] | |||||
(
GT(dup_purchase_df, rowname_col='Brands')
.tab_header(title='Duplication of Purchase Matrix',
subtitle='% of brand buyers that also purchased other brands')
.tab_stubhead(label="Brands")
.sub_missing(missing_text="")
.fmt_percent(decimals=0)
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 0.7],
palette=['white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('white'),
],
locations=[
loc.body(0,0),
loc.body(1,1),
loc.body(2,2),
loc.body(3,3),
loc.body(4,4)
]
)
.tab_source_note(md("""**Read Row/Column as**: _% of People who Purchased [Brand] / Who also Purchased [Brand]_"""))
)| Duplication of Purchase Matrix | |||||
|---|---|---|---|---|---|
| % of brand buyers that also purchased other brands | |||||
| Brands | Alpha | Bravo | Charlie | Delta | Other |
| Alpha | 100% | 34% | 15% | 9% | 3% |
| Bravo | 35% | 100% | 15% | 5% | 4% |
| Charlie | 50% | 47% | 100% | 14% | 3% |
| Delta | 63% | 37% | 31% | 100% | 3% |
| Other | 39% | 62% | 15% | 6% | 100% |
| Read Row/Column as: % of People who Purchased [Brand] / Who also Purchased [Brand] | |||||
The duplication of purchase matrix for year 1 is computed above. We see that 34% of those panellists that purchased Alpha in year 1 also made at least one purchase of Bravo that year. We see that 15% of Alpha buyers also bought Charlie, whereas 50% of Charlie buyers also bought Alpha. This asymmetry is not surprising given the relative size of the two brands.
We have just seen that 69% of total category volume purchasing by the buyers of Alpha went to that brand. We know from the duplication of purchase analysis that 15% of Alpha buyers also purchased Charlie. How much of their category volume purchasing went to Charlie? This is answered via cross purchase analysis (sometimes called a combination purchase analysis).
To illustrate the logic of the associated calculations, let us revisit the toy problem introduced earlier. We have the following summary of the purchasing of four brands by six households and the associated “ever buyers” matrix. We will assume the “ever buyers” matrix reports volume purchasing in kilograms.
Brand Purchasing
| A | B | C | D | Total | |
|---|---|---|---|---|---|
| HH01 | 1 | 0 | 2 | 0 | 3 |
| HH02 | 0 | 1 | 0 | 0 | 1 |
| HH03 | 1 | 3 | 0 | 0 | 4 |
| HH04 | 0 | 0 | 1 | 4 | 5 |
| HH05 | 1 | 1 | 0 | 1 | 3 |
| HH06 | 0 | 0 | 0 | 1 | 1 |
Ever Buyers Matrix (Incidence Matrix)
| A | B | C | D | |
|---|---|---|---|---|
| HH01 | 1 | 0 | 1 | 0 |
| HH02 | 0 | 1 | 0 | 0 |
| HH03 | 1 | 1 | 0 | 0 |
| HH04 | 0 | 0 | 1 | 1 |
| HH05 | 1 | 1 | 0 | 1 |
| HH06 | 0 | 0 | 0 | 1 |
We see that buyers of brand A purchased 3 kg (1 + 1 + 1, from A column) of brand A and a total of 10 kg (3 + 4 + 3, from total column) in the category (i.e., SCR = 30%). We see that they also purchased 4 kg (3 + 1, from B column) of brand B, 2 kg (2, from C column) of brand C and 1 kg (1, from D column) of brand D. The associated numbers for all brands are given in the following table:
| A | B | C | D | |
|---|---|---|---|---|
| A | 3 | 4 | 2 | 1 |
| B | 2 | 5 | 0 | 1 |
| C | 1 | 0 | 3 | 4 |
| D | 1 | 1 | 1 | 6 |
The sum of the elements of each row gives us the total amount of category purchasing by buyers of the brand of that row. Dividing each row entry by the sum of that row’s elements gives us the following cross purchasing table, the diagonal of which is obviously SCR.
| A | B | C | D | |
|---|---|---|---|---|
| A | 30% | 40% | 20% | 10% |
| B | 25% | 63% | 0% | 13% |
| C | 13% | 0% | 38% | 50% |
| D | 11% | 11% | 11% | 67% |
How do we create the table efficiently when we have a large number of panellists? One approach is to pre-multiply the panellist × brand volume purchasing summary table by the transpose of the “ever buyers” matrix \(A^{T}\cdot B\):
Transposed “ever buyer” matrix:
| HH01 | HH02 | HH03 | HH04 | HH05 | HH06 | |
|---|---|---|---|---|---|---|
| A | 1 | 0 | 1 | 0 | 1 | 0 |
| B | 0 | 1 | 1 | 0 | 1 | 0 |
| C | 1 | 0 | 0 | 1 | 0 | 0 |
| D | 0 | 0 | 0 | 1 | 1 | 1 |
Multiplied by Brand Volume Purchasing:
| A | B | C | D | |
|---|---|---|---|---|
| HH01 | 1 | 0 | 2 | 0 |
| HH02 | 0 | 1 | 0 | 0 |
| HH03 | 1 | 3 | 0 | 0 |
| HH04 | 0 | 0 | 1 | 4 |
| HH05 | 1 | 1 | 0 | 1 |
| HH06 | 0 | 0 | 0 | 1 |
The ‘ever buyer’ matrix creates a binary of 1 (buyer) or 0 (non-buyer) to help factor in the inclusions/exclusions and overlaps between brands purchased by all buyers. In this case, as we multiply the transposed form of the binary matrix with the quantity/volume matrix representing the quantity of purchase of different brands by each buyer, we are solving for a matrix that represents the total volume of purchasing for a set of brands by a set of buyers and their purchases of other brands.
# m x n matrix where m = # of panellists, n = # of brands and value is volume of purchases
brand_purch = np.array([
[1, 0, 2, 0],
[0, 1, 0, 0],
[1, 3, 0, 0],
[0, 0, 1, 4],
[1, 1, 0, 1],
[0, 0, 0, 1]
])
# m x n matrix where value is 1 if buyer ever purchased, 0 if never purchased
ever_buyers = np.where(brand_purch != 0, 1, 0)
# n x n square matrix representing row brand buyers and their volume/unit purchases from row brand and column brands
cross_purch_matrix = ever_buyers.T @ brand_purch
print('Cross Purchasing Matrix:')
print(cross_purch_matrix, cross_purch_matrix.shape)
print()
# The sum of the elements of each row gives us the total amount of category purchasing (volume) by buyers of the brand of that row
category_purch = cross_purch_matrix.sum(axis=1).reshape(4, 1)
print('Total Amount of Category Purchasing by Buyers of the Row Brand:')
print(brand_buyers, brand_buyers.shape)
print()
# n x n square matrix representing the proportion of volume purchased by row brand buyers for each COLUMN brands relative to total purchased of the row brand buyer
cross_purch = cross_purch_matrix / category_purch
print('Cross Purchasing Matrix as a % of Total Volume Purchased of Each Brand:')
print(cross_purch, cross_purch.shape)Cross Purchasing Matrix:
[[3 4 2 1]
[2 5 0 1]
[1 0 3 4]
[1 1 1 6]] (4, 4)
Total Amount of Category Purchasing by Buyers of the Row Brand:
[[2624]
[2562]
[ 813]
[ 380]
[ 176]] (5, 1)
Cross Purchasing Matrix as a % of Total Volume Purchased of Each Brand:
[[0.3 0.4 0.2 0.1 ]
[0.25 0.625 0. 0.125 ]
[0.125 0. 0.375 0.5 ]
[0.11111111 0.11111111 0.11111111 0.66666667]] (4, 4)
The cross purchasing analysis for year 1 is computed and reported below. We see that for those panellists that purchased Alpha at least once in year 1, 69% of their category volume purchased went to Alpha, 18% went to Bravo, 8% to Charlie, and so on.
brand_vol_purchases = vol_pivot(grocery_lf, 1).fill_null(0)
# Step 1: Create ever buyers matrix
ever_buyers = brand_vol_purchases.select(
[pl.when(pl.col(col) > 0).then(1).otherwise(0).alias(col) for col in brand_purchases.columns]
).to_numpy()
# Step 2: Compute cross purchase matrix
cross_purch_matrix = ever_buyers.T @ brand_vol_purchases
total_volume_purchases = np.sum(cross_purch_matrix, axis=1).reshape(5, 1)
# Step 3: Compute cross purchase matrix as a % of total volume purchased of each brand
cross_purch = cross_purch_matrix / total_volume_purchases
# DataFrame conversion for tabular view
cross_purch_matrix_df = (
pl.from_numpy(cross_purch_matrix, schema=brand_vol_purchases.columns)
.with_columns(pl.Series('Brands', brand_vol_purchases.columns))
)
cross_purch_df = (
pl.from_numpy(cross_purch, schema=brand_vol_purchases.columns)
.with_columns(pl.Series('Brands', brand_vol_purchases.columns))
)(
GT(cross_purch_matrix_df, rowname_col='Brands')
.tab_header(title='Cross Purchasing Analysis',
subtitle='Total Volume of Category Purchasing by Buyers of the Brand (by row)')
.tab_stubhead(label="Brands")
.sub_missing(missing_text="")
.fmt_integer()
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 3000],
palette=['white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('lightgrey'),
],
locations=[
loc.body(0,0),
loc.body(1,1),
loc.body(2,2),
loc.body(3,3),
loc.body(4,4)
]
)
.tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / Total Volume Purchased of [Brands]_"""))
)| Cross Purchasing Analysis | |||||
|---|---|---|---|---|---|
| Total Volume of Category Purchasing by Buyers of the Brand (by row) | |||||
| Brands | Alpha | Bravo | Charlie | Delta | Other |
| Alpha | 9,166 | 2,461 | 1,076 | 504 | 108 |
| Bravo | 2,522 | 8,240 | 907 | 282 | 170 |
| Charlie | 1,221 | 1,131 | 2,171 | 220 | 36 |
| Delta | 701 | 311 | 332 | 921 | 13 |
| Other | 223 | 398 | 53 | 20 | 286 |
| Read Row/Column as: Purchasers of [Brands] / Total Volume Purchased of [Brands] | |||||
(
GT(cross_purch_df, rowname_col='Brands')
.tab_header(title='Cross Purchasing Analysis',
subtitle='% of Total Volume of Category Purchasing by Buyers of the Brand (by row)')
.tab_stubhead(label="Brands")
.sub_missing(missing_text="")
.fmt_percent(decimals=0)
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 0.7],
palette=['white', 'rebeccapurple'],
)
.tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / % Total Volume Purchased of [Brands]_"""))
)| Cross Purchasing Analysis | |||||
|---|---|---|---|---|---|
| % of Total Volume of Category Purchasing by Buyers of the Brand (by row) | |||||
| Brands | Alpha | Bravo | Charlie | Delta | Other |
| Alpha | 69% | 18% | 8% | 4% | 1% |
| Bravo | 21% | 68% | 7% | 2% | 1% |
| Charlie | 26% | 24% | 45% | 5% | 1% |
| Delta | 31% | 14% | 15% | 40% | 1% |
| Other | 23% | 41% | 5% | 2% | 29% |
| Read Row/Column as: Purchasers of [Brands] / % Total Volume Purchased of [Brands] | |||||
For a given brand, we can plot the associated row entries as a pie chart, for example, see the Importance of competition plot, which plots the percentage of total volume of category purchasing by buyers of Alpha.
cross_purch_plot = (
pl.from_numpy(cross_purch.T, schema=brand_vol_purchases.columns, orient='row')
.with_columns(pl.Series('Brands', brand_vol_purchases.columns))
)
base = alt.Chart(cross_purch_plot).encode(
alt.Theta("Alpha:Q").stack(True),
alt.Color("Brands:N")
).properties(
width=600,
title={
'text': 'Importance of Competition to Buyers of Alpha',
'subtitle': '% of Total Volume of Category Purchasing by Buyers of Alpha'},
)
pie = base.mark_arc(outerRadius=145)
text = base.mark_text(radius=165, size=15).encode(
text=alt.Text("Alpha:N", format=".0%"))
pie + textAs previously noted when we computed SCR, we see that Alpha accounts for 69% of category purchasing by the buyers of Alpha. We see from the cross purchase analysis see that Bravo accounts for 18% of their category purchasing. Is this large or small? One way of answering this question is to compare actual purchasing against expectation given general purchasing patterns in the category.
# Share of Category Purchasing - Alpha Brand Selected
share_of_cat_purch = cross_purch[0]
# Market Share - Volume
market_share = (
brand_vol_purchases.sum() / # sum all columns -> returns a single row dataframe
brand_vol_purchases.sum_horizontal().sum() # sum of total category volume sales -> retruns a scalar value
).to_numpy().reshape(5)
# Share of residual purchasing -- Alpha Buyers
# The percentage of the category purchasing not accounted for by Alpha that goes to each of the other brands
residual_purch_brand = share_of_cat_purch / (1 - share_of_cat_purch[0])
# Share of residual purchasing -- Category Buyers
# Residual share of category purchasing (across all category buyers) once Alpha is removed
# when we exclude Alpha, what percentage of (the remaining) category purchasing goes to each of the other brands)
residual_purch_category = market_share / (1 - market_share[0])
# Index against expectation
index_against_expect = 100 * residual_purch_brand/residual_purch_category
# Importance Against Expectations - Create DataFrame
stack = np.vstack((
share_of_cat_purch,
market_share,
residual_purch_brand,
residual_purch_category,
index_against_expect))
cols = pl.DataFrame({'Col': ['Share of Category Purchasing',
'Volume Market Share',
'Share of Residual Purchasing - Alpha Buyers',
'Share of Residual Purchasing - Category Buyers',
'Index Against Expectation']})
rows = cross_purch_df.columns[:-1]
importance_against_expect = pl.from_numpy(stack, schema=rows).hstack(columns=cols)(
GT(importance_against_expect, rowname_col='Col')
.tab_header(title='Importance Against Expectation',
subtitle='Compare Actual Purchasing against Expectation Given General Purchasing Patterns')
.fmt_percent(rows=list(range(4)), decimals=0)
.fmt_integer(rows=4)
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
rows=4,
domain=[80, 170],
palette=['white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('white'),
],
locations=[loc.body(0,[2, 3, 4])]
)
)| Importance Against Expectation | |||||
|---|---|---|---|---|---|
| Compare Actual Purchasing against Expectation Given General Purchasing Patterns | |||||
| Alpha | Bravo | Charlie | Delta | Other | |
| Share of Category Purchasing | 69% | 18% | 8% | 4% | 1% |
| Volume Market Share | 44% | 40% | 10% | 4% | 1% |
| Share of Residual Purchasing - Alpha Buyers | 221% | 59% | 26% | 12% | 3% |
| Share of Residual Purchasing - Category Buyers | 79% | 71% | 19% | 8% | 2% |
| Index Against Expectation | 280 | 84 | 139 | 153 | 105 |
index_ae = pl.DataFrame({
'Index Against Expectation - By Volume Purchase': index_against_expect[1:],
'Brands': cross_purch_df.columns[1:-1]})
# Create the Altair chart
alt.Chart(index_ae).mark_bar().encode(
x=alt.X("Index Against Expectation - By Volume Purchase:Q"),
y=alt.Y("Brands:N"),
).properties(
width=650,
height=250,
title={"text": 'Importance Against Expectation'},
)Let us consider Charlie. With reference to Charlie’s residual share of category purchasing not accounted for by Alpha, we see that it accounted for 26% of the category purchasing by Alpha buyers that did not go to Alpha. If the purchasing of Alpha buyers was consistent with overall market patterns (as reflected in the volume market shares), we would expect Charlie to account for 19% of their purchasing. We can therefore say that Charlie’s share of purchasing amongst the buyers of Alpha is above expectation (when expectation is based on overall patterns of buying behaviour).
The Index Against Expectation essentially captures whether or not and to what extent volume purchases of Alpha’s buyers went to its competitors above or below expectations / patterns observed in the overall market.
We see that, relative to market share, Bravo is less of threat to Alpha than we would expect (index = 84). Relative to market share, Charlie and Delta are purchased more by buyers of Alpha than we would expect.
We can repeat these analyses using spend rather than volume purchasing.
We see, for example, that buyers of Alpha spent 71% of their category spend on Alpha. This is in contrast to the 69% of their category volume requirements satisfied by Alpha.
brand_spend_purchases = spend_pivot(grocery_lf, 1).fill_null(0)
# Step 1: Create ever buyers matrix
ever_buyers = brand_spend_purchases.select(
[pl.when(pl.col(col) > 0).then(1).otherwise(0).alias(col) for col in brand_purchases.columns]
).to_numpy()
# Step 2: Compute cross purchase matrix
cross_purch_matrix = ever_buyers.T @ brand_spend_purchases
total_spend_purchases = np.sum(cross_purch_matrix, axis=1).reshape(5, 1)
# Step 3: Compute cross purchase matrix as a % of total spend of each brand
cross_purch = cross_purch_matrix / total_spend_purchases
# DataFrame conversion for tabular view
cross_purch_matrix_df = (
pl.from_numpy(cross_purch_matrix, schema=brand_spend_purchases.columns)
.with_columns(pl.Series('Brands', brand_spend_purchases.columns))
)
cross_purch_df = (
pl.from_numpy(cross_purch, schema=brand_spend_purchases.columns)
.with_columns(pl.Series('Brands', brand_spend_purchases.columns))
)(
GT(cross_purch_matrix_df, rowname_col='Brands')
.tab_header(title='Cross Purchasing Analysis',
subtitle='Total Category Spend by Buyers of the Brand (by row)')
.tab_stubhead(label="Brands")
.sub_missing(missing_text="")
.fmt_integer()
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 10_000],
palette=['white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('lightgrey'),
],
locations=[
loc.body(0,0),
loc.body(1,1),
loc.body(2,2),
loc.body(3,3),
loc.body(4,4)
]
)
.tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / Total Spend on [Brands]_"""))
)| Cross Purchasing Analysis | |||||
|---|---|---|---|---|---|
| Total Category Spend by Buyers of the Brand (by row) | |||||
| Brands | Alpha | Bravo | Charlie | Delta | Other |
| Alpha | 33,571 | 8,913 | 2,606 | 1,807 | 571 |
| Bravo | 9,292 | 28,603 | 2,076 | 1,008 | 1,022 |
| Charlie | 4,481 | 3,854 | 5,121 | 793 | 129 |
| Delta | 2,597 | 1,149 | 806 | 3,272 | 57 |
| Other | 825 | 1,337 | 118 | 74 | 1,535 |
| Read Row/Column as: Purchasers of [Brands] / Total Spend on [Brands] | |||||
(
GT(cross_purch_df, rowname_col='Brands')
.tab_header(title='Cross Purchasing Analysis',
subtitle='% of Total Category Spend by Buyers of the Brand (by row)')
.tab_stubhead(label="Brands")
.sub_missing(missing_text="")
.fmt_percent(decimals=0)
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
domain=[0, 0.8],
palette=['white', 'rebeccapurple'],
)
.tab_source_note(md("""**Read Row/Column as**: _Purchasers of [Brands] / % Total Spend on [Brands]_"""))
)| Cross Purchasing Analysis | |||||
|---|---|---|---|---|---|
| % of Total Category Spend by Buyers of the Brand (by row) | |||||
| Brands | Alpha | Bravo | Charlie | Delta | Other |
| Alpha | 71% | 19% | 5% | 4% | 1% |
| Bravo | 22% | 68% | 5% | 2% | 2% |
| Charlie | 31% | 27% | 36% | 6% | 1% |
| Delta | 33% | 15% | 10% | 42% | 1% |
| Other | 21% | 34% | 3% | 2% | 39% |
| Read Row/Column as: Purchasers of [Brands] / % Total Spend on [Brands] | |||||
We can create a spend-based importance against expectation plot, using value market share as the reference.
# Share of Category Purchasing - Alpha Brand Selected
share_of_cat_purch = cross_purch[0]
# Market Share - Volume
market_share = (
brand_spend_purchases.sum() / # sum all columns -> returns a single row dataframe
brand_spend_purchases.sum_horizontal().sum() # sum of total category volume sales -> retruns a scalar value
).to_numpy().reshape(5)
# Share of residual purchasing -- Alpha Buyers
# The percentage of the category purchasing not accounted for by Alpha that goes to each of the other brands
residual_purch_brand = share_of_cat_purch / (1 - share_of_cat_purch[0])
# Share of residual purchasing -- Category Buyers
# Residual share of category purchasing (across all category buyers) once Alpha is removed
# when we exclude Alpha, what percentage of (the remaining) category purchasing goes to each of the other brands)
residual_purch_category = market_share / (1 - market_share[0])
# Index against expectation
index_against_expect = 100 * residual_purch_brand/residual_purch_category
# Importance Against Expectations - Create DataFrame
stack = np.vstack((
share_of_cat_purch,
market_share,
residual_purch_brand,
residual_purch_category,
index_against_expect))
cols = pl.DataFrame({'Col': ['Share of Category Purchasing',
'Volume Market Share',
'Share of Residual Purchasing - Alpha Buyers',
'Share of Residual Purchasing - Category Buyers',
'Index Against Expectation']})
rows = cross_purch_df.columns[:-1]
importance_against_expect = pl.from_numpy(stack, schema=rows).hstack(columns=cols)(
GT(importance_against_expect, rowname_col='Col')
.tab_header(title='Importance Against Expectation',
subtitle='Compare Actual Purchasing against Expectation Given General Purchasing Patterns')
.fmt_percent(rows=list(range(4)), decimals=0)
.fmt_integer(rows=4)
.data_color(
columns=['Alpha', 'Bravo', 'Charlie', 'Delta', 'Other'],
rows=4,
domain=[80, 170],
palette=['white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('white'),
],
locations=[loc.body(0,[2, 3, 4])]
)
)| Importance Against Expectation | |||||
|---|---|---|---|---|---|
| Compare Actual Purchasing against Expectation Given General Purchasing Patterns | |||||
| Alpha | Bravo | Charlie | Delta | Other | |
| Share of Category Purchasing | 71% | 19% | 5% | 4% | 1% |
| Volume Market Share | 47% | 40% | 7% | 5% | 2% |
| Share of Residual Purchasing - Alpha Buyers | 242% | 64% | 19% | 13% | 4% |
| Share of Residual Purchasing - Category Buyers | 87% | 74% | 13% | 8% | 4% |
| Index Against Expectation | 277 | 86 | 141 | 153 | 103 |
index_ae = pl.DataFrame({
'Index Against Expectation - By Spend': index_against_expect[1:],
'Brands': cross_purch_df.columns[1:-1]})
# Create the Altair chart
alt.Chart(index_ae).mark_bar().encode(
x=alt.X("Index Against Expectation - By Spend:Q"),
y=alt.Y("Brands:N"),
).properties(
width=650,
height=250,
title={"text": 'Importance Against Expectation'}
)Garbage Collect
exceptions = [
'grocery_lf',
'sku_lf',
'kiwi_lf',
'In',
'exceptions',
'active_variables',
'penetration',
'ppb',
'panel_size']
active_variables = [
var for var, value in globals().items()
if not var.startswith('_') # Exclude variables that start with "_"
and var not in exceptions # Exclude variables in the exceptions list
and isinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only
]
for var in active_variables:
del globals()[var]
del active_variables, exceptions, var
gc.collect()387
Up to now, we have been characterizing buyer behavior in a given time period (i.e., one year), be it focusing on one brand or multiple brands. We now consider some standard analyses that give insight into the dynamics of buyer behavior from period to period. We first consider the case of established products and then turn our attention to the analysis of new product buying behavior.
We are interested in summarizing how buyer behavior varies across consecutive periods. We first consider how temporal variations in total sales can be understood by decomposing total sales. Next we explore temporal variations in customer-level purchasing by examining how the distribution of purchasing in one period varies as a function of the level of purchasing in the previous period. Finally, we consider a summary measure of period-to-period purchasing called the repeat rate.
Most firms have systems that will report sales over time. As we try to make sense of any observed changes, it is helpful to note a fundamental (multiplicative) sales decomposition. For any time period,
\[ \begin{aligned} \text{Sales} &= \text{# households (HHs) in the country} \\ &= \times \text{proportion of HHs buying the brand (penetration)} \\ &= \times \text{# purchase occasions per buyer (PPB)} \\ &= \times \text{# packs per purchase} \\ &= \times \text{weight or price per pack} \\ \end{aligned} \]
There is nothing magical about this specific decomposition. We can create variations on a theme that are more relevant for the specific analysis setting at hand. For example, suppose we are doing an analysis at the brand level, where the SKUs associated with the brand come in different sizes. Furthermore, suppose the time period is sufficiently small that households make only one purchase per period, if at all. (In other words, PPB = 1). A more relevant decomposition would be
\[ \begin{aligned} \text{Sales (\$)} &= \text{# households (HHs) in the country} \\ &= \times \text{proportion of HHs buying the brand (penetration)} \\ &= \times \text{average volume per purchase} \\ &= \times \text{average price per unit of volume} \\ \end{aligned} \]
The product of the last two quantities is often called average order value (AOV).
\[ AOV = \text{average volume per purchase} \times \text{average price per unit of volume} \]
Recall the plot of Alpha’s revenue we created earlier:
weekly_plot(dataframe=weekly_spend_summary('Alpha', grocery_lf).collect(),
y='Weekly Spend',
title='Plot of weekly revenue for Alpha',
y_axis_label='Spend ($)',
pct=False,
legend=False)We observe some weeks where this is a massive increase in revenue. How much of this is due to an increase in penetration versus, say, buyers simply buying more product on a given purchase occasion? Let us explore this using the second decomposition given above.
week’ and aggregating to get the number of unique ‘trans_id’ and ‘panel_id’ per week. If the ‘num_trans’ and ‘num_buyers’ is the same for each week, then no panellist made more than one purchase in any given week, which means PPB = 1.week, compute volume and spend.(
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.group_by("week")
.agg(
pl.col("trans_id").n_unique().alias("num_trans"),
pl.col("panel_id").n_unique().alias("num_buyers")
).with_columns(
(pl.col('num_trans') / pl.col('num_buyers')).alias('ppb')
)
.sort('week')
.filter(pl.col('ppb') == 1)
).collect()| week | num_trans | num_buyers | ppb |
|---|---|---|---|
| u16 | u32 | u32 | f64 |
| 1 | 93 | 93 | 1.0 |
| 2 | 166 | 166 | 1.0 |
| 3 | 143 | 143 | 1.0 |
| 4 | 304 | 304 | 1.0 |
| 5 | 107 | 107 | 1.0 |
| … | … | … | … |
| 100 | 206 | 206 | 1.0 |
| 101 | 75 | 75 | 1.0 |
| 102 | 105 | 105 | 1.0 |
| 103 | 151 | 151 | 1.0 |
| 104 | 114 | 114 | 1.0 |
We note that the number of transactions (num_trans) associated with each week equals the number of panellists (num_buyers) associated with each week. In other words, no panellist made more than one purchase in any given week, which means PPB = 1 (num_trans / num_buyers), as assumed above.
# Filter for rows where the brand is "Alpha"
alpha_lf = grocery_lf.filter(pl.col("brand") == "Alpha")
# Aggregating to get the number of unique trans_id and panel_id per week
alpha_weekly_trans = (
alpha_lf
.group_by("week")
.agg(
pl.col("trans_id").n_unique().alias("num_trans"),
pl.col("panel_id").n_unique().alias("num_buyers")
)
)
# Aggregating to get the sum of spend and volume per week
alpha_weekly_spend_vol = (
alpha_lf
.with_columns(((pl.col('units') * pl.col('price'))).alias('spend'))
.join(other=sku_lf, on='sku_id')
.with_columns((((pl.col('units') * pl.col('weight'))/1000)).alias('volume'))
.group_by('week').agg(
pl.col("spend").sum().alias("spend"),
pl.col("volume").sum().alias("volume")
)
)
# The alpha_weekly_trans & alpha_weekly_spend_vol are joined.
# alpha_weekly_summary now contains week, num_trans, num_buyers, spend, and volume
alpha_weekly_summary = alpha_weekly_trans.join(alpha_weekly_spend_vol, on="week").sort('week')
alpha_weekly = (
alpha_weekly_summary
.with_columns(
# Weekly penetration = num_buyer / panel_sizes
(pl.col('num_buyers') / panel_size).alias('penet'),
# PPB = num_trans / num_buyer
(pl.col('num_trans') / pl.col('num_buyers')).alias('ppb'),
# Average Order Value = spend / num_trans
(pl.col('spend') / pl.col('num_trans')).alias('aov'),
# Average Order Value = spend / num_trans
(pl.col('volume') / pl.col('num_trans')).alias('aovol'),
# Average price per unit volume = spend / volume
(pl.col('spend') / pl.col('volume')).alias('avg_price_kg')
).select(
'week', 'spend', 'penet', 'aov', 'aovol', 'avg_price_kg'
)
)Now we compute the correlations between weekly revenue and the components of its (multiplicative) decomposition across the two years.
Correlations between Weekly Revenue & its Components - Over 2 Years
corr_matrix = alpha_weekly.select(pl.all().exclude('week')).collect().corr()
relabel = pl.Series(['Revenue', 'Penetration', 'AOV', 'AOVOL', 'Price/Unit'])
corr_matrix = corr_matrix.with_columns(pl.Series(relabel).alias("Decomposition"))
(
GT(corr_matrix, rowname_col='Decomposition')
.tab_header(title='Correlation Matrix of Revenue and the Components of its Decomposition',
subtitle='Correlations between weekly revenue and the components of its (multiplicative) decomposition')
.cols_label(
spend=relabel[0],
penet=relabel[1],
aov=relabel[2],
aovol=relabel[3],
avg_price_kg=relabel[4]
)
.cols_align(align='center')
.fmt_number(decimals=3)
.data_color(
columns=['spend', 'penet', 'aov', 'aovol', 'avg_price_kg'],
domain=[-1, 1],
palette=['orange', 'white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('lightgrey'),
],
locations=[
loc.body(0,0),
loc.body(1,1),
loc.body(2,2),
loc.body(3,3),
loc.body(4,4)
]
)
)| Correlation Matrix of Revenue and the Components of its Decomposition | |||||
|---|---|---|---|---|---|
| Correlations between weekly revenue and the components of its (multiplicative) decomposition | |||||
| Revenue | Penetration | AOV | AOVOL | Price/Unit | |
| Revenue | 1.000 | 0.982 | 0.542 | 0.758 | −0.590 |
| Penetration | 0.982 | 1.000 | 0.387 | 0.703 | −0.654 |
| AOV | 0.542 | 0.387 | 1.000 | 0.691 | −0.073 |
| AOVOL | 0.758 | 0.703 | 0.691 | 1.000 | −0.754 |
| Price/Unit | −0.590 | −0.654 | −0.073 | −0.754 | 1.000 |
Below is a plot of the penetration numbers. Comparing this with Alpha’s weekly revenue plot over the two years, we see that the fluctuations in revenue go hand-in-hand with fluctuations in penetration. To make this comparison clearer, we overlay the two series of numbers as well. The correlation between these to quantities is 0.98. It would appear that the key driver of revenue increases is simply more people buying the brand that week.
weekly_plot(dataframe=alpha_weekly.select('week', 'penet').collect(),
y='penet',
title='Alpha - Weekly Penetration',
y_axis_label='Penetration',
pct=True,
legend=False)# Method 1: Fold Transformation- convert wide-form data to long-form data directly without any preprocessing
source = alpha_weekly.select('week', 'spend', 'penet').collect()
base = alt.Chart(source).mark_line().transform_fold(
['Penetration', 'Revenue'],
as_=['Legend', 'Value']
).encode(
alt.Color('Legend:N'),
alt.X('week', axis=alt.Axis(
values=np.arange(0, 104 + 1, 13), # Explicitly specify quarter-end weeks
labelExpr="datum.value", # Show only these labels
title='Week'))
)
spend_line = base.transform_filter(
alt.datum.Legend == 'Revenue'
).encode(
y = alt.Y(
'spend:Q', # Q = a continuous real-valued quantity
axis=alt.Axis(format="$,.0f")).title('Revenue ($)')
)
penet_line = base.transform_filter(
alt.datum.Legend == 'Penetration'
).encode(
y = alt.Y(
'penet:Q', # Q = a continuous real-valued quantity
axis=alt.Axis(format=",.0%")).title('Penetration')
)
alt.layer(spend_line, penet_line).resolve_scale(
y='independent'
).properties(
width=600,
height=250,
title='Weekly Revenue & Penetration for Alpha'
).configure_view(stroke=None).configure_axisY(grid=True)# Method 2: Pre-processed - Wide-form to long-form conversion
source = alpha_weekly.select('week', 'spend', 'penet').unpivot(
index='week',
on=['spend', 'penet']
).collect()
base = alt.Chart(source).mark_line().encode(
alt.Color('variable:N'),
alt.X('week', axis=alt.Axis(
values=np.arange(0, 104 + 1, 13), # Explicitly specify quarter-end weeks
labelExpr="datum.value", # Show only these labels
title='Week'))
)
spend_line = base.transform_filter(
alt.datum.variable == 'spend'
).encode(
y = alt.Y(
'value:Q', # Q = a continuous real-valued quantity
axis=alt.Axis(format="$,.0f")).title('Revenue ($)')
)
penet_line = base.transform_filter(
alt.datum.variable == 'penet'
).encode(
y = alt.Y(
'value:Q', # Q = a continuous real-valued quantity
axis=alt.Axis(format=",.0%")).title('Penetration')
)
alt.layer(spend_line, penet_line).resolve_scale(
y='independent'
).properties(
width=600,
height=250,
title='Weekly Revenue & Penetration for Alpha'
).configure_view(stroke=None).configure_axisY(grid=True)However, the lack of a perfect correlation means there is some variability in average order value that is not highly correlated with penetration. We plot this quantity below. (The correlation between penetration and average order value is 0.39.) In order to get a sense of what lies behind the variability in average order value, we also plot weekly average order volume and average price per kg, respectively.
def weekly_plot(dataframe, y, year=2, title="", y_axis_label="", fmt='currency'):
if fmt == 'currency':
format = "$,.1f"
elif fmt == 'percent':
format = ",.0%"
else:
format = ",.1f"
chart = alt.Chart(dataframe).mark_line(strokeWidth=1).encode(
x = alt.X(
'week',
axis=alt.Axis(
values=np.arange(0, (year*52) + 1, 13), # Explicitly specify quarter-end weeks
labelExpr="datum.value", # Show only these labels
title='Week'
)
),
y = alt.Y(
f'{y}:Q', # Q = a continuous real-valued quantity
title=y_axis_label,
axis=alt.Axis(format=format)
)
).properties(
width=650,
height=250,
title=title
).configure_view(
stroke=None
)
return chart weekly_plot(dataframe=alpha_weekly.select('week', 'aov').collect(),
y='aov',
title='Alpha - Weekly Average Order Value',
y_axis_label='Average Order Value ($)',
fmt='currency')weekly_plot(dataframe=alpha_weekly.select('week', 'aovol').collect(),
y='aovol',
title='Alpha - Weekly Average Order Volume (kg)',
y_axis_label='Average Order Volume (kg)',
fmt='number')weekly_plot(dataframe=alpha_weekly.select('week', 'avg_price_kg').collect(),
y='avg_price_kg',
title='Alpha - Weekly Average Price/kg',
y_axis_label='Average $/kg ',
fmt='currency')Looking at the weekly average price per kg plot, there was much less variability in price/kg in weeks 1–52 compared to weeks 53–104. It would appear that there was some change in promotion policy between years 1 and 2. We do not have the data to explore this further, however.
Correlations between Weekly Revenue & its Components - Year 1
corr_matrix = alpha_weekly.filter(pl.col('week') <= 52).select(pl.all().exclude('week')).collect().corr()
relabel = pl.Series(['Revenue', 'Penetration', 'AOV', 'AOVOL', 'Price/Unit'])
corr_matrix = corr_matrix.with_columns(pl.Series(relabel).alias("Decomposition"))
(
GT(corr_matrix, rowname_col='Decomposition')
.tab_header(title='Correlation Matrix of Revenue and the Components of its Decomposition',
subtitle='Correlations between weekly revenue and the components of its (multiplicative) decomposition')
.cols_label(
spend=relabel[0],
penet=relabel[1],
aov=relabel[2],
aovol=relabel[3],
avg_price_kg=relabel[4]
)
.cols_align(align='center')
.fmt_number(decimals=3)
.data_color(
columns=['spend', 'penet', 'aov', 'aovol', 'avg_price_kg'],
domain=[-1, 1],
palette=['orange', 'white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('lightgrey'),
],
locations=[
loc.body(0,0),
loc.body(1,1),
loc.body(2,2),
loc.body(3,3),
loc.body(4,4)
]
)
)| Correlation Matrix of Revenue and the Components of its Decomposition | |||||
|---|---|---|---|---|---|
| Correlations between weekly revenue and the components of its (multiplicative) decomposition | |||||
| Revenue | Penetration | AOV | AOVOL | Price/Unit | |
| Revenue | 1.000 | 0.990 | 0.788 | 0.840 | −0.534 |
| Penetration | 0.990 | 1.000 | 0.707 | 0.778 | −0.553 |
| AOV | 0.788 | 0.707 | 1.000 | 0.956 | −0.349 |
| AOVOL | 0.840 | 0.778 | 0.956 | 1.000 | −0.606 |
| Price/Unit | −0.534 | −0.553 | −0.349 | −0.606 | 1.000 |
Correlations between Weekly Revenue & its Components - Year 2
corr_matrix = alpha_weekly.filter(pl.col('week') >= 53).select(pl.all().exclude('week')).collect().corr()
relabel = pl.Series(['Revenue', 'Penetration', 'AOV', 'AOVOL', 'Price/Unit'])
corr_matrix = corr_matrix.with_columns(pl.Series(relabel).alias("Decomposition"))
(
GT(corr_matrix, rowname_col='Decomposition')
.tab_header(title='Correlation Matrix of Revenue and the Components of its Decomposition',
subtitle='Correlations between weekly revenue and the components of its (multiplicative) decomposition')
.cols_label(
spend=relabel[0],
penet=relabel[1],
aov=relabel[2],
aovol=relabel[3],
avg_price_kg=relabel[4]
)
.cols_align(align='center')
.fmt_number(decimals=3)
.data_color(
columns=['spend', 'penet', 'aov', 'aovol', 'avg_price_kg'],
domain=[-1, 1],
palette=['orange', 'white', 'rebeccapurple'],
)
.tab_style(
style=[
style.text(color="white"),
style.fill('lightgrey'),
],
locations=[
loc.body(0,0),
loc.body(1,1),
loc.body(2,2),
loc.body(3,3),
loc.body(4,4)
]
)
)| Correlation Matrix of Revenue and the Components of its Decomposition | |||||
|---|---|---|---|---|---|
| Correlations between weekly revenue and the components of its (multiplicative) decomposition | |||||
| Revenue | Penetration | AOV | AOVOL | Price/Unit | |
| Revenue | 1.000 | 0.978 | 0.414 | 0.770 | −0.685 |
| Penetration | 0.978 | 1.000 | 0.227 | 0.710 | −0.748 |
| AOV | 0.414 | 0.227 | 1.000 | 0.602 | −0.035 |
| AOVOL | 0.770 | 0.710 | 0.602 | 1.000 | −0.804 |
| Price/Unit | −0.685 | −0.748 | −0.035 | −0.804 | 1.000 |
Annual Revenue Decomposition
This same logic can be used to analyse change across longer time periods. For example, we note from our analysis earlier that Alpha’s revenue increased from $33,571 in year 1 to $35,251 in year 2, a 5% increase. As we are looking at a static sample (i.e., the same group of panellists over the two years), the increase cannot be due to population growth. Are we observing this growth because more households are buying the product (increase in buyers), and/or because those that are buying it are buying it more often (increase in frequency), and/or because the average spend per transaction has increased (increase in AOV)?
In order to explore this, we will use the following (multiplicative) sales decomposition:
\[ \begin{aligned} \text{Annual Revenue}&= \text{# households (HHs) in the panel} \\ &= \times \text{proportion of HHs buying the brand (penetration)} \\ &= \times \text{# purchase occasions per buyer (PPB)} \\ &= \times \text{average order volume} \\ &= \times \text{average price per unit of volume} \\ \end{aligned} \]
In order to perform a similar decomposition of annual revenue, we first need to create a dataset that summarises, for each year, (1) the number of panellists that made at least one purchase of Alpha, (2) the total number of category purchase occasions on which Alpha was purchased, and (3) Alpha’s (dollar and volume) sales. We use the same logic as above, aggregating by year as opposed to week.
alpha_yearly_summary = (
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.join(other=sku_lf, on='sku_id')
.with_columns(
(pl.col('units') * pl.col('price')).cast(pl.Float64).alias('spend'),
(pl.col('units') * pl.col('weight')/1000).alias('volume'),
(pl.col('week') / 52).ceil().cast(pl.UInt16).alias('year')
).group_by('year')
.agg(
# Total number of category purchase occasions each year
pl.col('trans_id').n_unique().alias('num_trans'),
# The number of panellists that made at least one transaction each year
pl.col('panel_id').n_unique().alias('num_buyers'),
# Total revenue generated each year
pl.col('spend').sum(),
# Total volume purchased each year
pl.col('volume').sum()
)
)
alpha_yearly = (
alpha_yearly_summary
.with_columns(
# Yearly penetration = num_buyer / panel_sizes
(pl.col('num_buyers') / panel_size).alias('penet'),
# PPB = num_trans / num_buyer
(pl.col('num_trans') / pl.col('num_buyers')).alias('ppb'),
# Average Order Value = spend / num_trans
(pl.col('spend') / pl.col('num_trans')).alias('aov'),
# Average Order Value = spend / num_trans
(pl.col('volume') / pl.col('num_trans')).alias('aovol'),
# Average price per unit volume = spend / volume
(pl.col('spend') / pl.col('volume')).alias('avg_price_kg')
).with_columns(pl.col('year').cast(pl.String))
.collect()
)
# Transpose to cross-tabulate summary for presentation
alpha_yearly_T = alpha_yearly.transpose(
include_header=True, header_name='Summary', column_names='year'
)
# Compute percentage change for each variable from year 1 to year 2
alpha_yearly_pct_change = (
alpha_yearly
.with_columns(pl.col('*').exclude('year').pct_change())
.transpose(include_header=True, header_name='Summary', column_names='year')
.select('Summary' , pl.col('2').alias('pct_change'))
)
# Join summary variables and percentage change coss-tabular data for presentation
alpha_yearly_T = alpha_yearly_T.join(alpha_yearly_pct_change, on='Summary')relabel = pl.Series([
'# of Trans.',
'# of Buyers',
'Revenue',
'Volume',
'Penetration',
'PPB',
'AOV',
'AOVOL',
'Avg. Price/Unit'])
group = pl.Series(['Sales Summary']*4 + ['Revenue Decomposition']*5)
alpha_yearly_T = alpha_yearly_T.with_columns(
pl.Series(relabel).alias('Summary'),
pl.Series(group).alias('Group')
)
(
GT(alpha_yearly_T, rowname_col='Summary', groupname_col='Group')
.tab_header(title='Alpha - Annual Summary',
subtitle='Transactions, Buyers, Volume, Revenue & Revenue Decomposition')
.cols_label({
'1': 'Year 1',
'2': 'Year 2',
'pct_change': '% Change'})
.fmt_integer(rows=[0,1])
.fmt_number(rows=[2,3,5,7])
.fmt_currency(rows=[6,8])
.fmt_percent(rows=4)
.fmt_percent(columns=[3])
.data_color(
columns=3,
domain=[-1,1],
palette=['orange', 'white', 'rebeccapurple']
)
)| Alpha - Annual Summary | |||
|---|---|---|---|
| Transactions, Buyers, Volume, Revenue & Revenue Decomposition | |||
| Year 1 | Year 2 | % Change | |
| Sales Summary | |||
| # of Trans. | 9,060 | 9,240 | 1.99% |
| # of Buyers | 2,624 | 2,759 | 5.14% |
| Revenue | 33,570.94 | 35,250.75 | 5.00% |
| Volume | 9,166.25 | 10,346.40 | 12.87% |
| Revenue Decomposition | |||
| Penetration | 52.26% | 54.95% | 5.14% |
| PPB | 3.45 | 3.35 | −3.00% |
| AOV | $3.71 | $3.82 | 2.96% |
| AOVOL | 1.01 | 1.12 | 10.68% |
| Avg. Price/Unit | $3.66 | $3.41 | −6.97% |
We note that the 5% increase in revenue from year 1 to year 2 is associated with a 5% increase in the number of households making at least one purchase of Alpha in the year. While PPB drops, average order value increases, with these two changes effectively cancelling out each other. (The product of these two quantities changes by one cent between the two years. i.e, \((PPB_{2} \times AOV_{2})-(PPB_{1} \times AOV_{1})\).) While the average price/kg drops from year 1 to year 2, this is more than compensated by the increase in average order volume, resulting in a 3% increase in average order value between the two years (AOR = Average Order Volume x Avearge Price/Unit).
We have observed that the number of buyers has increased, yet the average number of transactions (PPB) has dropped. Does this mean the “new” buyers are light buyers? Or are the existing buyers buying less?
In order to dig deeper, we need to examine temporal variation in customer-level purchasing. A natural starting point is to examine the joint distribution of purchasing for two consecutive periods.
num_trans).# Summarizes the number of times Alpha was purchased in years 1 and 2 by each panellist
alpha_yearly_trans = (
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.with_columns((pl.col('week') / 52).ceil().cast(pl.UInt16).alias('year'))
# Aggregate to count unique 'trans_id' by 'panel_id' and 'year'
.group_by('panel_id','year')
.agg(pl.col('trans_id').n_unique().alias('num_trans'))
)
alpha_yearly_trans = (
# Reshape from long format to wide format
# Pivoting the dataframe based on 'year' to create a wide format.
alpha_yearly_trans.collect().pivot(
on='year', # Each unique year will create a new column
values='num_trans', # The column to aggregate
index='panel_id' # Rows will be indexed by 'panel_id'
)
# Rename the columns to 'Year 1' and 'Year 2'
.rename({'1': 'Year 1', '2': 'Year 2'})
# Replace any null values with 0
.fill_null(0)
)
# Create a basic joint distribution
joint_dist_trans = (
alpha_yearly_trans
.group_by('Year 1', 'Year 2')
.agg(pl.col('panel_id').count().alias('count'))
)
# Add in the number of panellists that made no purchase of Alpha in either year
# The processed dataframe so far is a subset of the main panel data containing the purchasing of those that bought Alpha at least once in the two years, there are 3142 such households.
# The panel contains 5021 panellists. Therefore the correct number of households that made zero purchases of Alpha in years 1 and 2 should be included in the processed dataframe.
tmp = panel_size - joint_dist_trans.select((pl.col('count')).sum()).item(0,0)
zero_purch = pl.DataFrame({'Year 1': 0, 'Year 2': 0, 'count': tmp})
# Right censor the distribution at 10+
joint_dist_trans = (
joint_dist_trans.with_columns(
pl.when(pl.col('Year 1') > 9)
.then(10)
.otherwise(pl.col('Year 1'))
.alias('Year 1')
)
.with_columns(
pl.when(pl.col('Year 2') > 9)
.then(10)
.otherwise(pl.col('Year 2'))
.alias('Year 2')
)
.vstack(zero_purch.cast(pl.UInt32))
)
# Unpivot the right censor data to create a year 1 x year 2 distribution matrix of the count of panellists who made x_{1} in year 1 & x_{2} in year 2
joint_dist_trans_pivot = (
joint_dist_trans
.sort('Year 2')
.pivot(
index='Year 1',
on='Year 2',
values='count',
aggregate_function='sum'
)
.rename({'Year 1': 'Year'})
.sort('Year')
.fill_null(0)
)
# Compute the row percentages, giving us the conditional distributions of transaction counts (the % of panellists from year 1 and their associated transaction counts in year 2)
# The conditional probability here is he empirical probability of making 𝑥_{2} transactions in year 2 given the panellist made 𝑥_{1} transactions in year 1
year1_trans_total = joint_dist_trans_pivot.select(
pl.col('*').exclude('Year')
).sum_horizontal()
pct_year1_total = joint_dist_trans_pivot.with_columns(
(pl.col('*').exclude('Year') / year1_trans_total)
)# Summarizes Alpha transactions by year and panelist
alpha_yearly_trans = (
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.with_columns((pl.col('week') / 52).ceil().cast(pl.UInt16).alias('year'))
.group_by(['panel_id', 'year'])
.agg(pl.col('trans_id').n_unique().alias('num_trans'))
.collect()
.pivot(
values='num_trans',
index='panel_id',
on='year'
)
.rename({'1': 'Year 1 Trans', '2': 'Year 2 Trans'})
.fill_null(0)
)
# Create joint distribution
joint_dist_trans = (
alpha_yearly_trans
.group_by(['Year 1 Trans', 'Year 2 Trans'])
.agg(pl.len().alias('count'))
)
# Add zero-purchase panellists
zero_purchasers = panel_size - joint_dist_trans['count'].sum()
joint_dist_trans = joint_dist_trans.vstack(
pl.DataFrame({'Year 1 Trans': [0], 'Year 2 Trans': [0], 'count': [zero_purchasers]}).cast(pl.UInt32)
)
# Right-censor transactions at 10+
joint_dist_trans = joint_dist_trans.with_columns([
pl.when(pl.col('Year 1 Trans') > 9).then(10).otherwise(pl.col('Year 1 Trans')).alias('Year 1 Trans'),
pl.when(pl.col('Year 2 Trans') > 9).then(10).otherwise(pl.col('Year 2 Trans')).alias('Year 2 Trans')
])
# Create the year 1 x year 2 distribution matrix
joint_dist_trans_pivot = (
joint_dist_trans
.sort('Year 2 Trans')
.pivot(
values='count',
index='Year 1 Trans',
on='Year 2 Trans',
aggregate_function='sum'
)
.rename({'Year 1 Trans': '# Trans'})
.sort('# Trans')
.fill_null(0)
)
# Compute conditional distribution (row percentages)
row_totals = joint_dist_trans_pivot.select(pl.col('*').exclude('# Trans')).sum_horizontal()
pct_joint_dist = joint_dist_trans_pivot.with_columns(
(pl.col('*').exclude('# Trans') / row_totals)
)
# Display final results
print("Alpha Yearly Transactions:")
display(alpha_yearly_trans)
print("Joint Distribution:")
display(joint_dist_trans)
print("Joint Distribution - Year 1 x Year 2 Distribution Matrix:", )
display(joint_dist_trans_pivot)
print("Conditional Distribution (% of Year 1 Transactions):")
display(pct_joint_dist)Alpha Yearly Transactions:
| panel_id | Year 2 Trans | Year 1 Trans |
|---|---|---|
| u32 | u32 | u32 |
| 3116650 | 1 | 2 |
| 3122796 | 1 | 3 |
| 3106933 | 0 | 1 |
| 3120325 | 5 | 5 |
| 3105207 | 5 | 9 |
| … | … | … |
| 3116871 | 1 | 0 |
| 3109117 | 1 | 0 |
| 3120851 | 2 | 0 |
| 3125858 | 0 | 1 |
| 3112986 | 1 | 0 |
Joint Distribution:
| Year 1 Trans | Year 2 Trans | count |
|---|---|---|
| u32 | u32 | u32 |
| 8 | 2 | 3 |
| 10 | 10 | 1 |
| 10 | 10 | 1 |
| 9 | 0 | 1 |
| 3 | 6 | 21 |
| … | … | … |
| 10 | 10 | 2 |
| 8 | 10 | 1 |
| 3 | 3 | 83 |
| 10 | 10 | 1 |
| 0 | 0 | 1879 |
Joint Distribution - Year 1 x Year 2 Distribution Matrix:
| # Trans | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| 0 | 1879 | 342 | 105 | 39 | 18 | 9 | 4 | 1 | 0 | 0 | 0 |
| 1 | 259 | 201 | 128 | 79 | 40 | 14 | 6 | 3 | 2 | 0 | 1 |
| 2 | 83 | 120 | 108 | 80 | 75 | 27 | 11 | 9 | 3 | 1 | 0 |
| 3 | 25 | 60 | 78 | 83 | 65 | 54 | 21 | 9 | 3 | 1 | 1 |
| 4 | 8 | 28 | 62 | 45 | 54 | 34 | 26 | 8 | 3 | 6 | 3 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| 6 | 1 | 6 | 15 | 17 | 24 | 31 | 20 | 13 | 11 | 3 | 3 |
| 7 | 0 | 2 | 7 | 5 | 15 | 15 | 16 | 14 | 8 | 4 | 15 |
| 8 | 1 | 1 | 3 | 5 | 9 | 9 | 16 | 6 | 10 | 8 | 3 |
| 9 | 1 | 0 | 3 | 4 | 4 | 4 | 7 | 8 | 4 | 5 | 7 |
| 10 | 0 | 1 | 3 | 0 | 3 | 8 | 12 | 14 | 12 | 5 | 49 |
Conditional Distribution (% of Year 1 Transactions):
| # Trans | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| u32 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| 0 | 0.783897 | 0.142678 | 0.043805 | 0.01627 | 0.007509 | 0.003755 | 0.001669 | 0.000417 | 0.0 | 0.0 | 0.0 |
| 1 | 0.353342 | 0.274216 | 0.174625 | 0.107776 | 0.05457 | 0.0191 | 0.008186 | 0.004093 | 0.002729 | 0.0 | 0.001364 |
| 2 | 0.160542 | 0.232108 | 0.208897 | 0.154739 | 0.145068 | 0.052224 | 0.021277 | 0.017408 | 0.005803 | 0.001934 | 0.0 |
| 3 | 0.0625 | 0.15 | 0.195 | 0.2075 | 0.1625 | 0.135 | 0.0525 | 0.0225 | 0.0075 | 0.0025 | 0.0025 |
| 4 | 0.028881 | 0.101083 | 0.223827 | 0.162455 | 0.194946 | 0.122744 | 0.093863 | 0.028881 | 0.01083 | 0.021661 | 0.01083 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| 6 | 0.006944 | 0.041667 | 0.104167 | 0.118056 | 0.166667 | 0.215278 | 0.138889 | 0.090278 | 0.076389 | 0.020833 | 0.020833 |
| 7 | 0.0 | 0.019802 | 0.069307 | 0.049505 | 0.148515 | 0.148515 | 0.158416 | 0.138614 | 0.079208 | 0.039604 | 0.148515 |
| 8 | 0.014085 | 0.014085 | 0.042254 | 0.070423 | 0.126761 | 0.126761 | 0.225352 | 0.084507 | 0.140845 | 0.112676 | 0.042254 |
| 9 | 0.021277 | 0.0 | 0.06383 | 0.085106 | 0.085106 | 0.085106 | 0.148936 | 0.170213 | 0.085106 | 0.106383 | 0.148936 |
| 10 | 0.0 | 0.009346 | 0.028037 | 0.0 | 0.028037 | 0.074766 | 0.11215 | 0.130841 | 0.11215 | 0.046729 | 0.457944 |
relabel = pl.Series([f'{i}' if i < 10 else '10+' for i in range(11)])
joint_dist_trans_pivot = joint_dist_trans_pivot.select(pl.col('*').exclude('# Trans')).with_columns(relabel.alias('# Trans'))
(
GT(joint_dist_trans_pivot, rowname_col='# Trans')
.tab_header(title='Joint Distribution of the Purchasing of Alpha in Year 1 & 2')
.tab_spanner(label='# of Transactions in Year 2', columns=joint_dist_trans_pivot.columns[:-1])
.tab_stubhead(label='Y1 / Y2')
.cols_label({'10': '10+'})
.cols_align(align='center')
.fmt_integer()
.data_color(
columns=[str(i) for i in range(10)],
domain=[0, 1900],
palette=['white', 'rebeccapurple'],
)
.tab_source_note(md("""**Read Row/Column as**: _# of Transactions in Year 1 / # of Transactions in Year 2_"""))
)| Joint Distribution of the Purchasing of Alpha in Year 1 & 2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Y1 / Y2 | # of Transactions in Year 2 | ||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10+ | |
| 0 | 1,879 | 342 | 105 | 39 | 18 | 9 | 4 | 1 | 0 | 0 | 0 |
| 1 | 259 | 201 | 128 | 79 | 40 | 14 | 6 | 3 | 2 | 0 | 1 |
| 2 | 83 | 120 | 108 | 80 | 75 | 27 | 11 | 9 | 3 | 1 | 0 |
| 3 | 25 | 60 | 78 | 83 | 65 | 54 | 21 | 9 | 3 | 1 | 1 |
| 4 | 8 | 28 | 62 | 45 | 54 | 34 | 26 | 8 | 3 | 6 | 3 |
| 5 | 5 | 13 | 28 | 31 | 49 | 46 | 23 | 20 | 5 | 3 | 4 |
| 6 | 1 | 6 | 15 | 17 | 24 | 31 | 20 | 13 | 11 | 3 | 3 |
| 7 | 0 | 2 | 7 | 5 | 15 | 15 | 16 | 14 | 8 | 4 | 15 |
| 8 | 1 | 1 | 3 | 5 | 9 | 9 | 16 | 6 | 10 | 8 | 3 |
| 9 | 1 | 0 | 3 | 4 | 4 | 4 | 7 | 8 | 4 | 5 | 7 |
| 10+ | 0 | 1 | 3 | 0 | 3 | 8 | 12 | 14 | 12 | 5 | 49 |
| Read Row/Column as: # of Transactions in Year 1 / # of Transactions in Year 2 | |||||||||||
How do we read this table? Year 1, Row 2 tell us how many people who bought Alpha once in year 1 bought Alpha 0, 1, 2, . . . times in year 2. For example, 259 households didn’t buy Alpha in year 2, 201 bought Alpha once in year 2, and so on.
relabel = pl.Series([f'{i}' if i < 10 else '10+' for i in range(11)])
pct_joint_dist = pct_joint_dist.select(pl.col('*').exclude('# Trans')).with_columns(relabel.alias('# Trans'))
(
GT(pct_joint_dist, rowname_col='# Trans')
.tab_header(title='Conditional Distribution of Transaction Counts',
subtitle=md('Empirical probability of making year 2 transaction given year 1 transaction'))
.tab_spanner(label='# of Transactions in Year 2', columns=pct_joint_dist.columns[:-1])
.tab_stubhead(label='Y1 / Y2')
.cols_label({'10': '10+'})
.cols_align(align='center')
.fmt_percent(decimals=0)
.data_color(
columns=[str(i) for i in range(11)],
domain=[0, 1],
palette=['white', 'rebeccapurple'],
)
.tab_source_note(md("""**Read Row/Column as**: _# of Transactions in Year 1 / # of Transactions in Year 2_"""))
)| Conditional Distribution of Transaction Counts | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Empirical probability of making year 2 transaction given year 1 transaction | |||||||||||
| Y1 / Y2 | # of Transactions in Year 2 | ||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10+ | |
| 0 | 78% | 14% | 4% | 2% | 1% | 0% | 0% | 0% | 0% | 0% | 0% |
| 1 | 35% | 27% | 17% | 11% | 5% | 2% | 1% | 0% | 0% | 0% | 0% |
| 2 | 16% | 23% | 21% | 15% | 15% | 5% | 2% | 2% | 1% | 0% | 0% |
| 3 | 6% | 15% | 20% | 21% | 16% | 14% | 5% | 2% | 1% | 0% | 0% |
| 4 | 3% | 10% | 22% | 16% | 19% | 12% | 9% | 3% | 1% | 2% | 1% |
| 5 | 2% | 6% | 12% | 14% | 22% | 20% | 10% | 9% | 2% | 1% | 2% |
| 6 | 1% | 4% | 10% | 12% | 17% | 22% | 14% | 9% | 8% | 2% | 2% |
| 7 | 0% | 2% | 7% | 5% | 15% | 15% | 16% | 14% | 8% | 4% | 15% |
| 8 | 1% | 1% | 4% | 7% | 13% | 13% | 23% | 8% | 14% | 11% | 4% |
| 9 | 2% | 0% | 6% | 9% | 9% | 9% | 15% | 17% | 9% | 11% | 15% |
| 10+ | 0% | 1% | 3% | 0% | 3% | 7% | 11% | 13% | 11% | 5% | 46% |
| Read Row/Column as: # of Transactions in Year 1 / # of Transactions in Year 2 | |||||||||||
How do we interpret this table? Looking at Year 1, Row 2, we see that 35.3% of the panellists that made one purchase of Alpha in year 1 made no purchases of Alpha in year 2, 27.4% purchased Alpha once, 17.5% purchased Alpha twice, and so on.
# Compute marginal transaction distribution for each year: count of # of panellists who made x_{t} transactions as a % of total panellists
def marginal_dist():
years = ['Year 1 Trans', 'Year 2 Trans']
dfs = []
for year in years:
y_dist = (
joint_dist_trans
.group_by(year)
.agg((pl.col('count').sum() / panel_size).alias(f'% of Total - {year}'))
.rename({year: 'num_trans'})
.sort('num_trans')
)
dfs.append(y_dist)
marginal_dist = dfs[0].hstack(dfs[1].drop('num_trans'))
return marginal_dist
marginal_dist()| num_trans | % of Total - Year 1 Trans | % of Total - Year 2 Trans |
|---|---|---|
| u32 | f64 | f64 |
| 0 | 0.477395 | 0.450508 |
| 1 | 0.145987 | 0.154153 |
| 2 | 0.102968 | 0.107548 |
| 3 | 0.079665 | 0.077275 |
| 4 | 0.055168 | 0.070902 |
| … | … | … |
| 6 | 0.02868 | 0.032264 |
| 7 | 0.020116 | 0.020912 |
| 8 | 0.014141 | 0.012149 |
| 9 | 0.009361 | 0.00717 |
| 10 | 0.02131 | 0.017128 |
# marginal distribution for each year and the associated clustered bar chart
# Step 1a: Compute marginal transaction distribution for each year: count of # of panellists who made x_{t} transactions as a % of total panellists
# Step 1b: Convert Wide-form data to long-form data
def marginal_dist():
years = ['Year 1 Trans', 'Year 2 Trans']
dfs = []
for year in years:
y_dist = (
joint_dist_trans
.group_by(year)
.agg((pl.col('count').sum() / panel_size).alias('% of Total')) # marginal distribution of purchasing of Alpha in the first & second year.
.with_columns(pl.lit(year).alias('Year'))
.rename({year: 'num_trans'})
)
dfs.append(y_dist)
marginal_dist = dfs[0].vstack(dfs[1])
return marginal_dist
# Step 2: Plot long-form data as grouped bar chart
alt.Chart(marginal_dist()).mark_bar().encode(
x=alt.X("num_trans:O", axis=alt.Axis(labelAngle=0, title='Number of Transactions')),
xOffset="Year:N",
y=alt.Y("% of Total:Q", axis=alt.Axis(format=".0%", title='% of Households')),
color='Year:N'
).properties(
width=650,
height=250,
title={"text": 'Distribution of Alpha Transactions', "subtitle": 'Marginal Distribution'},
)Let us made two immediate observations: - The marginal distribution of purchasing in the first year is reasonably similar to that for the second year. Some differences that stand out are the smaller percentage of households making zero purchases in year 2, which corresponds to the higher penetration, and the smaller percentage of households buying Alpha ten or more times in year 2. - When first seeing a table that shows the distribution of year 2 purchasing broken down by the level of year 1 purchasing, many expect there to be a strong diagonal in the table and are alarmed by the fact that this is rarely the case. It is important to realize that buying behaviour is not deterministic. From the perspective of the analyst, it can be viewed as-if random, bouncing around each person’s underlying propensity to buy the product. Someone who makes one purchase in year 1 is possibly a light buyer and so is the fact that 35.3% did not buy the product again in the second year that surprising? It does not mean that they are “lost”; most of them will buy Alpha again sometime the following year (year 3). Similarly, are those who did not buy the product in period 1 but did in period 2 new customers? Some probably are. But, for any established product category, most are probably people who have purchased the product in previous years and who, for whatever reason, did not buy it that year. The key to analysing such tables is to compare them to a benchmark that assumes stable underlying purchasing patterns.
Let’s explore how to create the joint distribution of category spend in years 1 and 2. The logic follows that of the binning of spend used to create the distribution of category spend in year 1 in the earlier section.
It is important to note that the initial aggregation by panel_id and year in the long-form DataFrame does not include entries for panellists who made no category purchases in either year 1 or year 2. However, we want to account for panellists who made a category purchase in one year but not the other, as well as those who did not make a category purchase in either year. When the data is pivoted, some spend range combinations for year 1 and year 2 contain null values, indicating that those panellists made a purchase in one year but not the other. Additionally, the DataFrame is missing panellists who made no purchases in both year 1 and year 2. To address the first issue, we replace the null values with float('-inf') (negative infinity) in place of the (-inf, 0] bin. To resolve the second issue, we calculate the total number of panellists in the DataFrame, subtract the count of existing panellists from the total panel_size, and add this difference back to the DataFrame.
# Summarizes Category Spend by year and panelist
joint_dist_spend = (
grocery_lf
.with_columns(
(pl.col('week') / 52).ceil().cast(pl.UInt16).alias('year'),
(pl.col('units') * pl.col('price')).cast(pl.Float32).alias('spend')
)
.group_by(['panel_id', 'year'])
.agg(pl.col('spend').sum())
.with_columns(pl.col('spend').cut(breaks=np.arange(0, 51, 5), include_breaks=True).alias('cut'))
.with_columns(pl.col('cut')).unnest('cut')
.collect()
.pivot(
values='breakpoint',
index='panel_id',
on='year'
)
.fill_null(float('-inf'))
# Create joint distribution
.rename({'1': 'Year 1', '2': 'Year 2'})
.group_by(['Year 1', 'Year 2'])
.agg(pl.len().alias('count'))
)
# Add zero-spend panellists
# The joint distribution of spend so far is for those panellists that made at least one category purchase across the two years.
# We want to also include those panellists that didn’t make a category purchase
zero_purchasers = pl.Series([panel_size - joint_dist_spend['count'].sum()]).cast(pl.UInt32)
joint_dist_spend = joint_dist_spend.vstack(
pl.DataFrame({'Year 1': [float('-inf')], 'Year 2': [float('-inf')], 'count': zero_purchasers})
)
# Create the year 1 x year 2 distribution matrix
joint_dist_spend_pivot = (
joint_dist_spend
.sort('Year 2')
.pivot(
values='count',
index='Year 1',
on='Year 2',
aggregate_function='sum'
)
.rename({'Year 1': 'Spend'})
.sort('Spend')
.fill_null(0)
)
print('Joint Distribution:')
display(joint_dist_spend)
print('Joint Distribution - Year 1 x Year 2 Distribution Matrix:')
display(joint_dist_spend_pivot)Joint Distribution:
| Year 1 | Year 2 | count |
|---|---|---|
| f64 | f64 | u32 |
| 10.0 | 5.0 | 217 |
| 50.0 | 45.0 | 6 |
| 45.0 | inf | 5 |
| 15.0 | 30.0 | 24 |
| 15.0 | -inf | 28 |
| … | … | … |
| 50.0 | 15.0 | 2 |
| 25.0 | inf | 3 |
| 10.0 | 15.0 | 196 |
| 45.0 | 40.0 | 6 |
| -inf | -inf | 183 |
Joint Distribution - Year 1 x Year 2 Distribution Matrix:
| Spend | -inf | 5.0 | 10.0 | 15.0 | 20.0 | 25.0 | 30.0 | 35.0 | 40.0 | 45.0 | 50.0 | inf |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| f64 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| -inf | 183 | 125 | 88 | 38 | 9 | 2 | 1 | 0 | 1 | 0 | 0 | 0 |
| 5.0 | 129 | 260 | 211 | 103 | 30 | 5 | 9 | 1 | 1 | 0 | 0 | 1 |
| 10.0 | 87 | 217 | 379 | 196 | 113 | 52 | 17 | 7 | 2 | 0 | 0 | 3 |
| 15.0 | 28 | 107 | 248 | 223 | 140 | 78 | 24 | 15 | 6 | 4 | 7 | 2 |
| 20.0 | 12 | 35 | 102 | 164 | 138 | 86 | 43 | 19 | 13 | 5 | 2 | 4 |
| … | … | … | … | … | … | … | … | … | … | … | … | … |
| 35.0 | 0 | 3 | 10 | 18 | 28 | 38 | 17 | 32 | 23 | 11 | 9 | 4 |
| 40.0 | 0 | 0 | 3 | 7 | 22 | 13 | 14 | 17 | 19 | 9 | 8 | 10 |
| 45.0 | 0 | 0 | 0 | 1 | 7 | 8 | 11 | 10 | 6 | 14 | 6 | 5 |
| 50.0 | 0 | 1 | 0 | 2 | 2 | 7 | 4 | 6 | 3 | 6 | 5 | 12 |
| inf | 1 | 1 | 0 | 3 | 4 | 5 | 8 | 13 | 9 | 7 | 11 | 47 |
relabel = pl.Series([
'-∞ - 0' if lb < 0 else f'{lb} - {lb + 5}' if lb != 50 else f'{lb} - ∞'
for lb in range(-5, 55, 5)
])
joint_dist_spend_pivot = joint_dist_spend_pivot.select(pl.col('*').exclude('Spend')).with_columns(relabel.alias('Spend'))
(
GT(joint_dist_spend_pivot, rowname_col='Spend')
.tab_header(title='Joint Distribution of Category Spend in Year 1 & 2')
.tab_spanner(label='Range of Spend in Year 2', columns=joint_dist_spend_pivot.columns[:-1])
.tab_stubhead(label='Year')
.cols_label({joint_dist_spend_pivot.columns[:-1][i]: relabel[i] for i in range(len(relabel))})
.cols_align(align='center')
.fmt_integer()
.data_color(
columns=joint_dist_spend_pivot.columns[:-1],
domain=[0, 400],
palette=['white', 'rebeccapurple'],
)
.tab_source_note(md("""**Read Row/Column as**: _Range of Spend in Year 1 / Range of Spend in Year 2_"""))
)| Joint Distribution of Category Spend in Year 1 & 2 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | Range of Spend in Year 2 | |||||||||||
| -∞ - 0 | 0 - 5 | 5 - 10 | 10 - 15 | 15 - 20 | 20 - 25 | 25 - 30 | 30 - 35 | 35 - 40 | 40 - 45 | 45 - 50 | 50 - ∞ | |
| -∞ - 0 | 183 | 125 | 88 | 38 | 9 | 2 | 1 | 0 | 1 | 0 | 0 | 0 |
| 0 - 5 | 129 | 260 | 211 | 103 | 30 | 5 | 9 | 1 | 1 | 0 | 0 | 1 |
| 5 - 10 | 87 | 217 | 379 | 196 | 113 | 52 | 17 | 7 | 2 | 0 | 0 | 3 |
| 10 - 15 | 28 | 107 | 248 | 223 | 140 | 78 | 24 | 15 | 6 | 4 | 7 | 2 |
| 15 - 20 | 12 | 35 | 102 | 164 | 138 | 86 | 43 | 19 | 13 | 5 | 2 | 4 |
| 20 - 25 | 4 | 19 | 52 | 82 | 90 | 81 | 63 | 25 | 17 | 9 | 5 | 3 |
| 25 - 30 | 0 | 4 | 21 | 29 | 54 | 47 | 45 | 22 | 21 | 6 | 2 | 5 |
| 30 - 35 | 0 | 3 | 10 | 18 | 28 | 38 | 17 | 32 | 23 | 11 | 9 | 4 |
| 35 - 40 | 0 | 0 | 3 | 7 | 22 | 13 | 14 | 17 | 19 | 9 | 8 | 10 |
| 40 - 45 | 0 | 0 | 0 | 1 | 7 | 8 | 11 | 10 | 6 | 14 | 6 | 5 |
| 45 - 50 | 0 | 1 | 0 | 2 | 2 | 7 | 4 | 6 | 3 | 6 | 5 | 12 |
| 50 - ∞ | 1 | 1 | 0 | 3 | 4 | 5 | 8 | 13 | 9 | 7 | 11 | 47 |
| Read Row/Column as: Range of Spend in Year 1 / Range of Spend in Year 2 | ||||||||||||
One common summary measure is the repeat rate (or repeat-buying rate), which is defined as the percentage of the brand’s customers in a given period who also purchase the product in the following period. We now explore how to compute quarter-by-quarter repeat rates.
We will carry out the analysis and plot the results for Alpha.
The first thing we do is create a quarterly incidence matrix that indicates whether or not each panellist purchased Alpha each quarter, i.e. a table that indicates whether or not each panellist made at least one purchase of Alpha in each of the eight quarters in our dataset (i.e., an ever-buyers (by quarter) table for Alpha).
panel_id and columns represent quarter.1 for yes, 0 for no).Recall that the repeat rate is the percentage of a brand’s customers in a given period who also purchase the product in the following period. We compute the repeat-buying rate (RBR) using the function:
\[ \text{RBR}(q \to q+1) = \frac{\sum_{i=1}^{N} Q_{q,i} \cdot Q_{q+1,i}}{\sum_{i=1}^{N} Q_{q,i}} \]
where, \(Q_{q,i}\) represent the value of quarter \(q\) for panellist \(i\). It is \(1\) if the panellist purchased in quarter \(q\), and \(0\) otherwise.
\(N\) represent the total number of panellists.
\(Q_{q,i} \cdot Q_{q+1,i}\) represents element-wise multiplication for panellist \(i\). It is \(1\) if the panellist purchased in both quarters \(q\) and \(q+1\), and \(0\) otherwise.
\(\sum_{i=1}^{N} Q_{q,i} \cdot Q_{q+1,i}\) counts the number of panellists who purchased in both quarter \(q\) and quarter \(q+1\), at least once.
\(\sum_{i=1}^{N} Q_{q,i}\) counts the total number of panellists (households) who purchased in quarter \(q\).
\(\text{RBR}(q \to q+1)\) measures the proportion of buyers in quarter \(q\) who purchased again in quarter \(q+1\).
The element-wise operation ensures the calculation is based on individual panellists’ purchase patterns.
\(Q_{q,i} \cdot Q_{q+1,i}\) leverages the binary nature of \(Q\), where the product is \(1\) only when both values are \(1\). The formula aggregates over all panellists (\(i\)) to compute both the numerator and denominator.
alpha_qtrly_incidence = (
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.with_columns((np.floor((pl.col('week') - 1) / 13) + 1).cast(pl.UInt16).alias('quarter'))
.group_by('panel_id', 'quarter')
.agg(pl.lit(1).alias('incidence')) # Mark 1 for incidence
.collect()
.pivot(
values='incidence',
index='panel_id',
on='quarter',
aggregate_function='max' # Ensures presence is represented as 1
)
.fill_null(0) # Replace missing quarters with 0
)
# Calculate Repeat Buying Rate (RBR)
rbr = []
num_quarters = alpha_qtrly_incidence.shape[1] - 1 # Exclude 'panel_id' column
for q in range(1, num_quarters): # Iterate through quarter pairs dynamically
prev_col = str(q)
next_col = str(q + 1)
# Compute RBR for each quarter pair
rbr_value = (
(alpha_qtrly_incidence[prev_col] * alpha_qtrly_incidence[next_col]).sum() /
alpha_qtrly_incidence[prev_col].sum()
)
rbr.append(rbr_value)
# Convert RBR values to a DataFrame for clarity
rbr = pl.DataFrame({"Quarter Pair": [f"{q}->{q+1}" for q in range(1, num_quarters)], "RBR": rbr})
display(alpha_qtrly_incidence)
display(rbr)| panel_id | 6 | 7 | 8 | 5 | 1 | 4 | 2 | 3 |
|---|---|---|---|---|---|---|---|---|
| u32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 | i32 |
| 3102846 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3105496 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 3108116 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3104250 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 3112170 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| … | … | … | … | … | … | … | … | … |
| 3124638 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3122584 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3120507 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3103052 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3118007 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Quarter Pair | RBR |
|---|---|
| str | f64 |
| "1->2" | 0.656682 |
| "2->3" | 0.614922 |
| "3->4" | 0.549313 |
| "4->5" | 0.593472 |
| "5->6" | 0.691401 |
| "6->7" | 0.597488 |
| "7->8" | 0.525498 |
# Prepare the incidence matrix
alpha_qtrly_incidence = (
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.with_columns((np.floor((pl.col('week') - 1) / 13) + 1).cast(pl.UInt16).alias('quarter'))
.group_by('panel_id', 'quarter')
.agg(pl.lit(1).alias('incidence')) # Mark 1 for incidence
.collect()
.pivot(
values='incidence',
index='panel_id',
on='quarter',
aggregate_function='max' # Ensures presence is represented as 1
)
.fill_null(0) # Replace missing quarters with 0
)
# Get all quarters dynamically
quarters = alpha_qtrly_incidence.columns[1:] # Exclude 'panel_id'
# Compute RBR for all quarter pairs using vectorized operations
rbr = (
pl.DataFrame(
{
f"Q{q} -> Q{q+1}": [
(
(alpha_qtrly_incidence[str(q)] * alpha_qtrly_incidence[str(q + 1)]).sum() /
alpha_qtrly_incidence[str(q)].sum()
)
] # Wrap in a list for DataFrame creation
for q in range(1, len(quarters))
}
)
.transpose(include_header=True, header_name='Quarter Pair', column_names=['RBR']) # Transpose for desired layout & rename resulting columns
)alt.Chart(rbr).mark_line(strokeWidth=1).encode(
x = alt.X('Quarter Pair:O', title='Quarter', axis=alt.Axis(labelAngle=0)),
y = alt.Y('RBR:Q', title='Repeat Rate', axis=alt.Axis(format=",.0%"), scale=alt.Scale(domain=[0, 0.8]))
).properties(
width=650, height=250,title='Plot of quarterly repeat rates for Alpha'
).configure_view(stroke=None)We see that there is some variability in the repeat rate. Is it possibly in decline? One thing to realise is that the variability can be driven by the firm’s promotional activities. Aggressive promotions may attract a segment of consumers that only buy on promotion. If the firm promotes heavily in one quarter and has fewer promotions the following quarter, we could expect to a drop in the repeat rate. We can explore this by looking at how Alpha’s promotional activity (as reflected in price/kg) varied across quarters.
alpha_qtrly_price_avg = (
grocery_lf
.filter(pl.col('brand') == 'Alpha')
.join(sku_lf, on='sku_id')
.with_columns(
(np.floor((pl.col('week') - 1) / 13) + 1).cast(pl.UInt16).alias('quarter'),
(((pl.col('units') * pl.col('weight'))/1000)).alias('volume'),
(pl.col('units') * pl.col('price')).cast(pl.Float32).alias('spend')
)
.with_columns((pl.col('spend') / pl.col('volume')).alias('price/kg'))
.select('quarter', 'price/kg')
.group_by('quarter')
.agg(pl.col('price/kg').mean().alias('avg_price_per_kg'))
.collect()
)
alt.Chart(alpha_qtrly_price_avg).mark_line(strokeWidth=1).encode(
x = alt.X('quarter:O', title='Quarter', axis=alt.Axis(labelAngle=0)),
y = alt.Y('avg_price_per_kg:Q', title='Price/kg', axis=alt.Axis(format="$,.2f"), scale=alt.Scale(domain=[3, 4.5]))
).properties(
width=650, height=250,title="Quarterly Average Price/kg - Alpha"
).configure_view(stroke=None)Garbage Collect
exceptions = ['kiwi_lf', 'In', 'exceptions', 'active_variables']
active_variables = [
var for var, value in globals().items()
if not var.startswith('_') # Exclude variables that start with "_"
and var not in exceptions # Exclude variables in the exceptions list
and isinstance(value, (pl.LazyFrame, pl.DataFrame, pl.Series, alt.Chart, alt.LayerChart, list, int, float, str, np.ndarray, np.int64, np.float32)) # Remove these types only
]
for var in active_variables:
del globals()[var]
del active_variables, exceptions, var
gc.collect()375
Refer to Depth-of-Repeat Sales Summary for more details
Central to diagnosing the performance of a new product is the decomposition of its total sales into trial and repeat sales. A given overall aggregate sales history could be the realization of very different purchasing scenarios. For example, a low sales level for a new product could be the result of (i) many consumers making a trial (i.e., first-ever) purchase but few of them making a repeat purchase (because the product does not meet their expectations), or (ii) a low trial rate but a high level of repeat purchasing amongst the triers (because the product meets a real need among a relatively small set of buyers). Without a trial/repeat sales decomposition, it is impossible to determine which of these scenarios best describes the sales data. (And of course it is impossible to make such decompositions without access to household-level purchasing data (i.e., consumer panel data).)
We conduct a trail/repeat sales decomposition below, the plot produced shows the week-by-week sales of Kiwi Bubbles (dataset 2) broken down into trial and repeat sales (i.e., first-ever purchases of the new product by a panellist vs. subsequent purchases of the new product by a panellist).
Another variant of the sales decomposition is conducted in the section decomposing cumulative sales into its trial, first repeat, and additional repeat components in which we plot cumulative sales and separates repeat sales into first repeat (second-ever) purchases and additional repeat (second repeat (third-ever) purchases + third repeat (fourth-ever) purchases + . . . ). For this plot, we see that 45% of Kiwi Bubble’s year 1 sales came from trial purchases and 38% were due to additional repeat purchases.
Given the data summary from which these two plots are created, we can create further plots that give us insight into buyer behaviour. For example, we can plot cumulative trial (growth in cumulative trial over time), which shows the percentage of households in the market that have made a trial purchase by any week in the new product’s first year on the market. (We see that just under 10% of households have tried the new product by year end. This still appears to be growing but at a far slower rate than earlier on in the year.)
People can be induced to try a new product by promotions, etc. but the key to success is repeat purchasing. The first step is making a first-repeat purchase. One useful metric to track is “percent triers repeating”—the percentage of panellists that made a trial purchase that have gone on to make a (first) repeat purchase. The associated plot for Kiwi Bubbles is constructed in the section Evolution of the percentage of triers that have made a repeat purchase.
Our initial goal is to create a summary of new product purchasing from which decompositions can be computed and associated plots can be created.
We will focus on purchasing by those panellists in Market 2. Here are the steps:
Set num_panellists for Market 2 as 1,499
Filter dataset for column Market where market is equal to 2.
Add a day of year (DoY) column which is computed as: (Week of the Year - 1) * 7 + Day of the Week = Day of Year
Create a Depth of Repeat (DoR) level associated with each transaction, where 0 = Trial Purchase, 1 = First Repeat Purchase, 2 = Second Repeat Purchase, and so on. This is done by using Polars’ cum_count() and over() expression. Note: this operations requires the records to be sorted chronologically for each panellist.
cum_count() computes a running count of rows within each group, starting from 0 for the first appearance of the ID. -
over() expression is a window function. It performs calculations across a subset of rows, grouped by one or more columns, without collapsing or aggregating the data at the group level.
The over("ID") groups rows by ID and calculates the cumulative count only within each group.
Results are returned in the same row layout, preserving the original structure.
Once we start to process the data further to summarize how many trails, first repeat purchase, etc. purchase occured in each week, we want there to be all 52 weeks of the year, to be available in the summary. However, it turns out that this panel of 1499 households only purchased the test product in 49 weeks; no purchases occurred in weeks 25, 39, and 41. We need to append the missing weeks to the dataframe constructed so far and consider null values for rest of the columns.
We create a pivoted, Transaction DoR by Week and Total Unit Sales DoR by Week copy of the processed dataframe.
# Number of panellists in Market 2
num_panellists = 1499
kiwi_df = (
kiwi_lf
.filter(pl.col('Market') == 2)
.drop('Market')
# DoY -> day of year variable, where the 1 corresponds to the day the new product was launched.
.with_columns(((pl.col('Week') - 1) * 7 + pl.col('Day')).alias('DoY'))
.sort(by=['ID', 'DoY'])
.with_columns(
(pl.col("ID").cum_count().over("ID") - 1).alias("DoR")
)
.collect()
)
kiwi_df| ID | Week | Day | Units | DoY | DoR |
|---|---|---|---|---|---|
| u16 | i16 | i16 | i16 | i16 | u32 |
| 20001 | 49 | 1 | 1 | 337 | 0 |
| 20002 | 14 | 7 | 1 | 98 | 0 |
| 20003 | 24 | 6 | 1 | 167 | 0 |
| 20004 | 49 | 2 | 1 | 338 | 0 |
| 20005 | 6 | 7 | 1 | 42 | 0 |
| … | … | … | … | … | … |
| 20136 | 26 | 6 | 1 | 181 | 5 |
| 20136 | 37 | 7 | 1 | 259 | 6 |
| 20137 | 13 | 6 | 1 | 90 | 0 |
| 20138 | 18 | 7 | 1 | 126 | 0 |
| 20139 | 49 | 3 | 1 | 339 | 0 |
# The smallest unit of time for our analyses is day. Ensure we do not have any panellist with more than one transaction on any day.
tmp_kiwi = (
kiwi_df
.group_by('ID', 'DoY') # group_by function aggregates columns passed as parameters together by collapsing the data at the group level if values are the same or overlapping.
.agg(pl.col('Units').sum())
)
print("1) There aren't multiple transaction records (rows) by the same panellist on the same day:")
print(len(kiwi_df) == len(tmp_kiwi)) # If comparison return True, then length of the two datasets is the same and the group_by function did not find same/overlapping transactions by the same ID on the same DoY.
# Did at least one panellist make a purchase of this new product each week?
print("2) Number of weeks where at least one unit of the new product was purchased/sold:")
print(len(kiwi_df['Week'].unique())) # No, there are only 49 weeks of the year where at least one unit of the new product was sold
# Which weeks are missing?
print("3) Weeks with no transactions:")
missing_weeks = list(set(np.arange(1, 53, dtype='int16')) - set(kiwi_df['Week'].unique()))
print(missing_weeks)1) There aren't multiple transaction records (rows) by the same panellist on the same day:
True
2) Number of weeks where at least one unit of the new product was purchased/sold:
49
3) Weeks with no transactions:
[np.int16(25), np.int16(41), np.int16(39)]
missing_df = pl.DataFrame(
{col: [None]*len(missing_weeks) for col in kiwi_df.columns}
).with_columns(
pl.Series('Week', values = missing_weeks)
)
kiwi_df = kiwi_df.vstack(missing_df)
kiwi_df| ID | Week | Day | Units | DoY | DoR |
|---|---|---|---|---|---|
| u16 | i16 | i16 | i16 | i16 | u32 |
| 20001 | 49 | 1 | 1 | 337 | 0 |
| 20002 | 14 | 7 | 1 | 98 | 0 |
| 20003 | 24 | 6 | 1 | 167 | 0 |
| 20004 | 49 | 2 | 1 | 338 | 0 |
| 20005 | 6 | 7 | 1 | 42 | 0 |
| … | … | … | … | … | … |
| 20138 | 18 | 7 | 1 | 126 | 0 |
| 20139 | 49 | 3 | 1 | 339 | 0 |
| null | 25 | null | null | null | null |
| null | 41 | null | null | null | null |
| null | 39 | null | null | null | null |
trans_dor_week = (
kiwi_df
.group_by(['Week', 'DoR'])
.agg(pl.len().cast(pl.UInt8).alias('Count')) # Count occurrences for each (Week, DoR) pair
.sort('DoR')
.pivot(index='Week', on='DoR', values='Count') # Pivot the data
.fill_null(0) # Fill any missing values with 0
.sort('Week') # Ensure the rows are sorted by 'Week'
.drop('null')
.with_columns(pl.sum_horizontal(pl.col('*').exclude('Week')).alias('Total')) # Calculate total count per Week
)
trans_dor_week| Week | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i16 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 | u8 |
| 1 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 |
| 2 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
| 3 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 4 | 16 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 |
| 5 | 8 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 48 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 |
| 49 | 4 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 8 |
| 50 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 8 |
| 51 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 |
| 52 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 6 |
qty_dor_week = (
kiwi_df
.group_by(['Week', 'DoR'])
.agg((pl.col('Units').sum()).alias('Total Units')) # Count occurrences for each (Week, DoR) pair
.sort('DoR')
.pivot(index='Week', on='DoR', values='Total Units', aggregate_function='sum') # Pivot the data
.fill_null(0) # Fill any missing values with 0
.sort('Week') # Ensure the rows are sorted by 'Week'
.drop('null')
.with_columns(pl.sum_horizontal(pl.col('*').exclude('Week')).alias('Total')) # Calculate total count per Week
)
qty_dor_week| Week | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i16 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 1 | 9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |
| 2 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
| 3 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 4 | 19 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 22 |
| 5 | 8 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 48 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 |
| 49 | 4 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 0 | 0 | 0 | 9 |
| 50 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 2 | 2 | 1 | 11 |
| 51 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 4 |
| 52 | 2 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 8 |
Now we will plot the trial/repeat decomposition of sales, which breaks weekly sales into its trial and repeat components, in the following manner:
trial_repeat_decomp = (
trans_dor_week[['Week', '0', 'Total']]
.with_columns((pl.col('Total') - pl.col('0')).alias('Repeat'))
.rename({'0': 'Trial'})
.unpivot(index='Week', on=['Trial', 'Repeat'], variable_name='TrialRepeat', value_name='Sales')
)
alt.Chart(trial_repeat_decomp).mark_area().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 13), labelExpr="datum.value")),
y=alt.Y('Sales:Q', title='Sales (Transactions)'),
color="TrialRepeat:N"
).properties(
width=650, height=250, title='Trial/Repeat Decomposition of Sales'
).configure_view(stroke=None)trial_repeat_decomp_stacked = (
trans_dor_week[['Week', '0', 'Total']]
.with_columns((pl.col('Total') - pl.col('0')).alias('Repeat'))
.drop('Total')
.rename({'0': 'Trial'})
.with_columns(pl.sum_horizontal(pl.col('*').exclude('Week')).alias('Total'))
.with_columns(
(pl.col('Trial') / pl.col('Total')).alias('Trial'),
(pl.col('Repeat') / pl.col('Total')).alias('Repeat')
)
.unpivot(index='Week', on=['Trial', 'Repeat'], variable_name='TrialRepeat', value_name='Sales')
.fill_nan(0)
)
alt.Chart(trial_repeat_decomp_stacked).mark_area().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0,values=np.arange(0, 53, 13), labelExpr="datum.value")),
y=alt.Y('Sales:Q', title='Sales (Transactions)', axis=alt.Axis(format=".0%")),
color="TrialRepeat:N"
).properties(
width=650, height=250, title='% Decomposition of Sales'
).configure_view(stroke=None)trial_repeat_decomp_qty = (
qty_dor_week[['Week', '0', 'Total']]
.with_columns((pl.col('Total') - pl.col('0')).alias('Repeat'))
.rename({'0': 'Trial'})
.unpivot(index='Week', on=['Trial', 'Repeat'], variable_name='TrialRepeat', value_name='Sales')
)
alt.Chart(trial_repeat_decomp_qty).mark_area().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 13), labelExpr="datum.value")),
y=alt.Y('Sales:Q', title='Sales (Units)'),
color="TrialRepeat:N"
).properties(
width=650, height=250, title='Trial/Repeat Decomposition of Sales by Total Units'
).configure_view(stroke=None)Our next plot breaks cumulative sales into its trial, first repeat (FR), and additional repeat (AR) components, in the following manner:
fr_ar_decomp = (
trans_dor_week[['Week', '0', '1', 'Total']]
.with_columns((pl.col('Total') - pl.col('0') - pl.col('1')).alias('AR'))
.rename({'0': 'Trial', '1': 'FR'})
.with_columns(
pl.col('Trial').cum_sum().alias('Trial'),
pl.col('FR').cum_sum().alias('FR'),
pl.col('AR').cum_sum().alias('AR')
)
.unpivot(index='Week', on=['Trial', 'FR', 'AR'], variable_name='TrialRepeat', value_name='Sales')
)
alt.Chart(fr_ar_decomp).mark_area().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 13), labelExpr="datum.value")),
y=alt.Y('Sales:Q', title='Sales (Transactions)'),
color="TrialRepeat:N"
).properties(
width=650, height=250, title='Decomposition of Cumulative Sales (Transactions)'
).configure_view(stroke=None)fr_ar_decomp_units = (
qty_dor_week[['Week', '0', '1', 'Total']]
.with_columns((pl.col('Total') - pl.col('0') - pl.col('1')).alias('AR'))
.rename({'0': 'Trial', '1': 'FR'})
.with_columns(
pl.col('Trial').cum_sum().alias('Trial'),
pl.col('FR').cum_sum().alias('FR'),
pl.col('AR').cum_sum().alias('AR')
)
.unpivot(index='Week', on=['Trial', 'FR', 'AR'], variable_name='TrialRepeat', value_name='Sales')
)
alt.Chart(fr_ar_decomp_units).mark_area().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 13), labelExpr="datum.value")),
y=alt.Y('Sales:Q', title='Sales (Units)'),
color="TrialRepeat:N"
).properties(
width=650, height=250, title='Decomposition of Cumulative Sales (by Total Units)'
).configure_view(stroke=None)We not plot the cumulative growth in the percentage of households (% of panel size) that have made a trial purchase (sometimes called cumulative penetration), in the following manner:
cum_trials = (
fr_ar_decomp
.filter(pl.col('TrialRepeat') == 'Trial')
.with_columns((pl.col('Sales') / num_panellists).alias('pct_cum_total'))
.drop('TrialRepeat','Sales')
)
alt.Chart(cum_trials).mark_line().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 4), labelExpr="datum.value")),
y=alt.Y('pct_cum_total:Q', title='% of Households', axis=alt.Axis(format='.0%')),
).properties(
width=650, height=250, title='Kiwi Bubbles Cumulative Trials'
).configure_view(stroke=None)We now plot the evolution of the percentage of panellists that made a trial purchase that have gone on to make a (first) repeat purchase, in the following manner:
pct_triers_repeating = (
fr_ar_decomp
.filter((pl.col('TrialRepeat') == 'Trial') | (pl.col('TrialRepeat') == 'FR'))
.pivot(on='TrialRepeat', index='Week', values='Sales')
.with_columns((pl.col('FR') / pl.col('Trial')).alias('pct_triers_rpting'))
.drop('Trial','FR')
)
alt.Chart(pct_triers_repeating).mark_line().encode(
x=alt.X('Week:O', title='Week', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 4), labelExpr="datum.value")),
y=alt.Y('pct_triers_rpting:Q', title='% of Triers', axis=alt.Axis(format='.0%')),
).properties(
width=650, height=250, title='% of Triers Repeating'
).configure_view(stroke=None)Looking at the percentage of triers repeat purchasing plot from earlier, we see that the appeal of the new product is such that just over 37% of those customers that made a trial purchase in the first year it was on the market ended up making a repeat purchase of the new product in that year. Note that these % triers repeating numbers plotted are in calendar time (or, more precisely, time since the launch of the product). We can compute a related measure that tells us how many weeks after their trial purchase a panellist makes their (first) repeat purchase.
The goal of the code is to compute and visualize the cumulative percentage of triers (consumers) who made their first repeat purchase a certain number of weeks after their trial purchase. The analysis includes: 1. Identifying trial weeks (Trial Week) and the time difference (FR Delta) between the trial purchase and the first repeat purchase. 2. Aggregating data to determine the counts of first repeats per trial week and calculating cumulative first repeat as a percentage of total trial week class buyers. 3. Visualizing these cumulative percentages for triers making their first repeat purchase.
As a first step, we create a summary that tells us the cumulative percentage of triers that have made a (first) repeat purchase so many weeks after their trial purchase, broken down by time of trial.
Step 1: Trial Week and First Repeat Delta (FR Delta) Computation
We start by creating a “week of trial purchase” variable (Trial Week) and a variable the counts the number of weeks between a panellist’s trial and first repeat purchase (FR Delta). - Trial Week variable equals the value of the week if this is a trial purchase and None otherwise. - The week of the trial purchase is recorded if it is the panellist’s first purchase (DoR = 0). Otherwise, it is set to None.
\[ \text{Trial Week} = \begin{cases} \text{Week}, & \text{if DoR} = 0 \\ \text{None}, & \text{otherwise}. \end{cases} \]
FR Delta varilable tells us how many weeks after the trail purchase the panellist’s first repeat purchase occured (assuming it was observered). This is computed using shift method as follows:
DoR and Week columns by 1 row to get access to the next row’s values.pl.when condition to compute the FR Delta based on whether the next row’s DoR is 1.DoR in the next row is 1, compute the difference between the next row’s Week and the current row’s Trial Week. Otherwise, assign None.None indicates that a first repeat purchase was not observed for this panellist in the 52-week observation period. Strictly speaking, FR delta = 0 if the first-repeat purchase occurs in the same calendar week as the trial purchase, 1 if the first-repeat purchase occurs in the calendar week immediately after that in which the trial purchase occurred, etc.\[ \text{FR Delta} = \begin{cases} \text{Next\_Week} - \text{Trial Week}, & \text{if Next\_DoR} = 1 \\ \text{None}, & \text{otherwise}. \end{cases} \]
Step 2: Aggregate Counts of First Repeats & Total Triers
Next, we aggregate the number of IDs (Count) for each combination of Trial Week and FR Delta. Rows with Trial Week and FR Delta both being None are excluded.
Additionally, we aggreate the total number of triers in each Trial Week class. (Sum of ID counts for each Trial Week class) This aggregation includes both, triers that did make a repeat purchase and triers that did not make a repeat purchase. As expeceted, not all triers made a repeat purchase. - For example, we see that eight panellists made a trial purchase during the first week Kiwi Bubbles was on the market. One panellist purchased the product again that same week. Another made their first repeat purchase two weeks after their trial. Summing the count of week-1 triers who made their first repeat purchase within the 52-week observation period, we see that five of the eight week-1 triers (62.5%) had made a first repeat purchase.
Step 3: Include Missing Combinations & Perform Cumulative Operations
We cannot assume that all the trial and “time from trial to FR” weeks are observed in the dataset, so we need to fill in the missing values.
Trial Week (1–52) and FR Delta (0–51). This ensures all potential weeks are included, even if they are absent in the observed data.
agg_trial_fr: Aggregated counts of first repeat purchases grouped by Trial Week and FR Delta.trial_week_total: Total triers in each Trial Week.We must compute the cumulative percentage of triers that have made a first repeat purchase so many weeks after their trial purchase.
First, we calculate cumulative counts of first repeat purchases (Cum FR by Week) within each Trial Week class.
Second, we compute the cumulative percentages of first repeats (Cum FR by Trial):
\[ \text{Cum FR by Trial} = \begin{cases} \frac{\text{Cum FR by Week}}{\text{Total Triers}}, & \text{if Total Triers} > 0 \\ 0, & \text{otherwise}. \end{cases} \]
FR Delta (e.g., Week 52 triers cannot have an FR Delta greater than 0), assign None to such rows.The cumulative percentage values (Cum FR by Trial) are weighted by the number of triers (Total Triers) to compute a weighted average for visualization.
\[ \text{Weights}(\text{Trial Week}, \text{FR Delta}) = \text{Cum FR by Trial} \times \text{Total Triers} \]
Ensures that larger trial cohorts contribute proportionally more to the overall average, avoiding bias from small sample sizes.
These weights are later used to calculate the weighted cumulative percentage of first repeat purchases across all trial weeks.
Step 4: Filter and Compute Time to First Repeat Curve
This step takes the cumulative dataset (cum_fr_by_trial), filters it to focus on valid ranges for visualization, and computes the weighted average of cumulative first repeat percentages across trial weeks.
Filter Data to Valid Range
Trial Week <= 26) and corresponding first repeat purchases up to 26 weeks after the trial (FR Delta <= 26).Trial Week 26 have exactly 26 weeks to make their first repeat purchase. Including them for longer periods would be misleading since no data is observed beyond 26 weeks.Aggregate Across Trial Weeks
The filtered data is grouped by FR Delta (weeks after trial purchase) to aggregate:
Total Triers: Total number of trial purchases within the valid range.Weights: Summed weights for cumulative first repeat percentages across all Trial Week cohorts.\[ \text{Weights (Summed)} = \sum_{\text{Trial Week}} \left( \text{Cum FR by Trial} \times \text{Total Triers} \right) \]
Compute Weighted Time to First Repeat
Weights and Total Triers, compute the weighted average of cumulative first repeat percentages:\[ \text{Time to FR} = \frac{\text{Weights (Summed)}}{\text{Total Triers (Summed)}} \]
FR Delta is weighted by the size of the trial cohort (Total Triers) contributing to that value.kiwi_trial_fr = (
kiwi_lf
.filter(pl.col('Market') == 2)
.drop('Market')
# DoY -> day of year variable, where the 1 corresponds to the day the new product was launched.
.with_columns(((pl.col('Week') - 1) * 7 + pl.col('Day')).alias('DoY'))
.sort(by=['ID', 'DoY']) # Sorting is necessary to organize the data so that DoR can be computed
.with_columns((pl.col("ID").cum_count().over("ID") - 1).alias("DoR"))
.with_columns(
# Trial Week = Week when DoR == 0, else None
pl.when(pl.col('DoR') == 0).then(pl.col('Week')).otherwise(None).alias('Trial Week'),
# Shift DoR and Week to get the next row's values
pl.col('DoR').shift(-1).alias('Next_DoR'),
pl.col('Week').shift(-1).alias('Next_Week')
).with_columns(
# Compute FR Delta based on next row's DoR and Week
pl.when(pl.col('Next_DoR') == 1)
.then(pl.col('Next_Week') - pl.col('Trial Week'))
.otherwise(None)
.alias('FR Delta')
).select('ID', 'Trial Week', 'FR Delta') # Drop unused & temporary columns
)
# Count of panellists for each combination of Trial Week and FR Delta
agg_trial_fr = (
kiwi_trial_fr
.group_by(['Trial Week', 'FR Delta'])
.agg(pl.len().alias('Count'))
.filter(
~pl.all_horizontal(pl.col('*').exclude('Count').is_null()) # Drop rows if 'Trial Week' and 'FR Delta' columns are null
)
)
# Sum of triers in each Trial Week class
trial_week_total = (
agg_trial_fr
.group_by('Trial Week')
.agg(pl.col('Count').sum().alias('Total Triers'))
.sort('Trial Week')
.fill_null(0)
)
# Dummy dataset of all combinations of Trial Weeks and FR Delta within the period of analysis - handle missing combinations
trial_week_range, fr_delta_range = np.meshgrid(np.arange(1, 53, dtype='int16'), np.arange(0, 52, dtype='int16'))
trial_fr_range = pl.LazyFrame({'Trial Week': trial_week_range.reshape(-1), 'FR Delta': fr_delta_range.reshape(-1)})
# Join dummy dataset with aggregated
cum_fr_by_trial = (
trial_fr_range
.join(trial_week_total, on='Trial Week', how='left')
.join(agg_trial_fr, on=['Trial Week', 'FR Delta'], how='left')
.fill_null(0)
.with_columns(pl.col('Count').cum_sum().over('Trial Week').alias('Cum FR by Week')) # calculate cumulative counts of first repeat purchases within each Trial Week class
.with_columns(
pl.when(pl.col('Trial Week') > (52 - pl.col('FR Delta')))
.then(None) # If the number of remaining weeks after the trial is less than the `FR Delta` (e.g., Week 52 triers cannot have an `FR Delta` greater than 0), assign `None` to such rows.
.otherwise(
pl.when(pl.col('Total Triers') > 0)
.then(pl.col('Cum FR by Week') / pl.col('Total Triers')) # compute the cumulative percentages of first repeats
.otherwise(0)
)
.alias('Cum FR by Trial')
)
.with_columns(
(pl.col('Cum FR by Trial') * pl.col('Total Triers')).alias('Weights') # Compute weights
)
)time_to_fr_filtered = (
cum_fr_by_trial
.filter(
(pl.col('Trial Week') <= 26) &
(pl.col('FR Delta') <= 26)
)
.group_by('FR Delta')
.agg(
pl.col('Total Triers').sum(),
pl.col('Weights').sum()
)
.with_columns((pl.col('Weights') / pl.col('Total Triers')).alias('Time to FR')) # compute the weighted average of cumulative first repeat percentages across trial weeks
.sort('FR Delta')
.collect()
)
time_to_fr_filtered| FR Delta | Total Triers | Weights | Time to FR |
|---|---|---|---|
| i16 | u32 | f64 | f64 |
| 0 | 101 | 4.0 | 0.039604 |
| 1 | 101 | 10.0 | 0.09901 |
| 2 | 101 | 16.0 | 0.158416 |
| 3 | 101 | 18.0 | 0.178218 |
| 4 | 101 | 21.0 | 0.207921 |
| … | … | … | … |
| 22 | 101 | 40.0 | 0.39604 |
| 23 | 101 | 40.0 | 0.39604 |
| 24 | 101 | 41.0 | 0.405941 |
| 25 | 101 | 41.0 | 0.405941 |
| 26 | 101 | 42.0 | 0.415842 |
alt.Chart(time_to_fr_filtered).mark_line().encode(
x=alt.X('FR Delta:O', title='Weeks After Trial', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 2), labelExpr="datum.value")),
y=alt.Y('Time to FR:Q', title='% of Triers', axis=alt.Axis(format='.0%')),
).properties(
width=650, height=250, title='Time to First Repeat - Empirical distribution of the time from trial to first repeat purchase'
).configure_view(stroke=None)We see that 42% of triers have made a first repeat purchase within a year of their trial purchase. This is higher than the 37% of triers that have made a repeat purchase by the end of week 52 reported in the % of triers repeat purchasing plot from earlier. Why is this the case? This is because the plot in the earlier section considered the all triers in the first year, whereas the plot in this section look at the triers up until the 26th week. Considering a small trier base acquired earlier in the year, we aloow for more sufficient time for triers to make their first repeat purchase. Given that the earlier plot considers all triers up until the last week of the obeservation period, we are unable to provide sufficient time for those triers make a repeat purchase.
kiwi_fr_2r = (
kiwi_lf
.filter(pl.col('Market') == 2)
.drop('Market')
.with_columns(((pl.col('Week') - 1) * 7 + pl.col('Day')).alias('DoY'))
.sort(by=['ID', 'DoY'])
.with_columns((pl.col("ID").cum_count().over("ID") - 1).alias("DoR"))
.with_columns(
pl.when(pl.col('DoR') == 1).then(pl.col('Week')).otherwise(None).alias('FR Week'), # FR Week = Week when DoR == 1, else None
pl.col('DoR').shift(-1).alias('Next_DoR'),
pl.col('Week').shift(-1).alias('Next_Week')
).with_columns(
pl.when(pl.col('Next_DoR') == 2) # Compute 2R Delta based on next row's DoR and Week
.then(pl.col('Next_Week') - pl.col('FR Week'))
.otherwise(None)
.alias('2R Delta')
).select('ID', 'FR Week', '2R Delta')
)
agg_fr_2r = (
kiwi_fr_2r
.group_by(['FR Week', '2R Delta'])
.agg(pl.len().alias('Count'))
.filter(
~pl.all_horizontal(pl.col('*').exclude('Count').is_null())
)
)
fr_week_total = (
agg_fr_2r
.group_by('FR Week')
.agg(pl.col('Count').sum().alias('Total FRs'))
.sort('FR Week')
.fill_null(0)
)
fr_week_range, delta_2r_range = np.meshgrid(np.arange(1, 53, dtype='int16'), np.arange(0, 52, dtype='int16'))
fr_2r_range = pl.LazyFrame({'FR Week': fr_week_range.reshape(-1), '2R Delta': delta_2r_range.reshape(-1)})
cum_2r_by_trial = (
fr_2r_range
.join(fr_week_total, on='FR Week', how='left')
.join(agg_fr_2r, on=['FR Week', '2R Delta'], how='left')
.fill_null(0)
.with_columns(pl.col('Count').cum_sum().over('FR Week').alias('Cum 2R by Week'))
.with_columns(
pl.when(pl.col('FR Week') > (52 - pl.col('2R Delta')))
.then(None)
.otherwise(
pl.when(pl.col('Total FRs') > 0)
.then(pl.col('Cum 2R by Week') / pl.col('Total FRs'))
.otherwise(0)
)
.alias('Cum 2R by FR')
)
.with_columns(
(pl.col('Cum 2R by FR') * pl.col('Total FRs')).alias('Weights')
)
)
time_to_2r_filtered = (
cum_2r_by_trial
.filter(
(pl.col('FR Week') <= 26) &
(pl.col('2R Delta') <= 26)
)
.group_by('2R Delta')
.agg(
pl.col('Total FRs').sum(),
pl.col('Weights').sum()
)
.with_columns((pl.col('Weights') / pl.col('Total FRs')).alias('Time to 2R'))
.sort('2R Delta')
.collect()
)
time_to_2r_filtered| 2R Delta | Total FRs | Weights | Time to 2R |
|---|---|---|---|
| i16 | u32 | f64 | f64 |
| 0 | 35 | 1.0 | 0.028571 |
| 1 | 35 | 4.0 | 0.114286 |
| 2 | 35 | 9.0 | 0.257143 |
| 3 | 35 | 11.0 | 0.314286 |
| 4 | 35 | 13.0 | 0.371429 |
| … | … | … | … |
| 22 | 35 | 25.0 | 0.714286 |
| 23 | 35 | 25.0 | 0.714286 |
| 24 | 35 | 25.0 | 0.714286 |
| 25 | 35 | 25.0 | 0.714286 |
| 26 | 35 | 25.0 | 0.714286 |
alt.Chart(time_to_2r_filtered).mark_line().encode(
x=alt.X("2R Delta:O", title='Weeks After First Repeat', axis=alt.Axis(labelAngle=0, values=np.arange(0, 53, 2), labelExpr="datum.value")),
y=alt.Y('Time to 2R:Q', title='% of First Repeaters', axis=alt.Axis(format='.0%')),
).properties(
width=650, height=250, title='Time to Second Repeat - Empirical distribution of the time from first repeat to second repeat purchase'
).configure_view(stroke=None)